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Methods and Systems for Profiling Biological Systems 

This application claims priority to and the benefit of U.S. Provisional Patent Application 
Serial No. 60/496,657, filed on August 20, 2003, the entire disclosure of which is incorporated 
by reference herein. 

Field of the Invention 

5 The invention relates to the field of data processing and evaluation. More particularly, 

_ the invention relates to methodsjmd.systems for profiling a-state of a-biologicaHystetn; e.g., a- 
marnmal such as a human. 

Background 

Gunent approaches to understanding biology, such as genomics and proteomics, typically 

10 focus on a single aspect of a biological system at any one time. The "omics" technology 

revolution, particularly that of genomics, has provided a basis for studies of a single type of r 
biomolecule both in single cell organisms, e.g., yeast, and in simple, multi-cellular systems, such 
as sea urchin embryos. In both types of studies, the systems are perturbed by environmental 
changes and/or genetic manipulation to enable the correlation of gene expression changes in a 

15 number of different scenarios. Construction of in silico interaction networks is facilitated by 
looking at interdependencies between and among genes ftom several different perspectives. ' 
However, while modem quantitative genomic technologies arc readily available, die resulting 
information may be of low precision and utility. For example, in one sea urchin study, a 
perturbation was deemed significant only if it gave rise to a three-fold or greater change in gene 

20 expression Althougha number of experimental fectors might contribute to the net variability in 
a system and reduce precision, a significant biological effect may be manifested by a change that 
occurs well under a three-fold cut-off. 

Analyzing and understanding a complex, multi-cellular organism, such as a mammal, is 
muchmore complicated. When studying the state of a complex-biological system, onemusttake 

25 into account the multi-compartmental character of the system, not to mention the variety of cell 
and tissue types that will have unique gene expression and protein and metabolite levels. 
Current studies that rely on the analysis of a single aspect of a biological system, e.g., a single 
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type of molecule or target, usually are not robust enough to understand Hie entire biological 
system or subsystem that may be involved in a particular molecular pathway or disease. 

An important challenge in the understanding of a biological system of a mammal and the 
development of new drugs for complex, multi-factorial diseases is the identification and 
5 validationof biomarkers/surrogate markers. Moreover, it appears that instead of single 

biomarkers being indicative of a state of a biological system, biomarker patterns or biomarker 
sets may be necessary to characterize and diagnose homeostasis or disease states for a biological 
system, where multiple Ieyels of the biological system are simultaneously considered in the 
■analysis. Accordingly, there is a need for methods and systems that consider a biological system 
10 as a whole andjihai are able tQ^vaiis&JasLStudy.6f huinanidisease',^d4he discovery -and — 
development of ^ph^aceutical products. 

Summary of .the Invention 

The applicants of this patent application ariepiorieers in a field known as "systems 
fciolq©^ In contrast to analysis of an individual aspect of avbiolo^cal system, systems biology. 

15 is the study of biology as.an integrated biological system including genetic, protein and 

metabolic components, and thear patiiways, which are in .flux and^interde^ndent Rather than 
artificially siinplifyirigAe inherent complexity of biological processes that underlie the biology 
of a complex organism, e.g., rthebiological processes involved ^ 
drug responses, the methods and systems described hetein embrace the complexities and 

20 interdependences contained withk a biological system. By appropriately -\dsualizing and 
considering the complexity of a biological system, a skilled artisan can undertake biological 
research at the systems level, developing a profile for a state of a biological system which 
provides insight into the biological system as a whole. 

lie application describes methods and systems to analyze complex clinical samples of 
' 25 mammals including humans at a bidlogical systems level to provide hew information about the 
state of a biological system that was previously unobtainable through traditional chemistries or 
genomics alone. Using the methods and systems described herein, it is possible to gain insight 
into biological pathways and mechanisms of disease and drug response. More specifically, the 
methods and systems can analyze and integrate data at the biomolecular component type level, 

30 i.e., the gene/gene transcript, protein and metabolite level, to create knowledge that advances 
pharmaceutical research and development by providing new insights into the molecular 
mechanisms of health and disease, which further the development and discovery of novel 
therapeutics to treat human disease. 
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To develop a profile of a state of abiological system, e ; g., a disease state, multiple 
measurements on complex biological samples are performed; Subsequently, comprehensive 
gene,, gene transcript, protein, and/or metabolite profiling coupled with correlation analysis and 
network modeling provides insight into a biological system at a systems level so that 
5 connections, correlations, and relationships among thousands of diverse, measurable molecular 
components can be achieved. Such knowledge men may be used directly for the development of 
therapeutic agents or biomarkers, may be used in combination with clinical information, and/or 
may serve as a basis for directed, hypomesis-driven experiments designed to further elucidate 
pathophysiologic mechanisms. Fur&er, tacMngxh^ 

_10 ^improve many aspects of pharmaceutical discovery and deyelopment^mclndmgjirug^afety-and 
efficacy, drug response, and the etiology of disease. 

The application addresses limitations in current profiling techniques by providing a 
method and system, or a "technology platforn^?' having the abiUty to integrate a plurality of data 
setei which may include two or more biomolecular component types, to elucidate infoimatioh 

15 conveying associations between or among components or networks of interactions among 

components. The methods and systems utilize statistical analyses of a plurality of data sets, e.g., 
spectrometric data, to develop a profile of a state of a biological system, e.g^ a mammal such as 
a human. The data sets comprise multiple measurements of the biological system and are 
derived from three primary sources: a biological sample type, a measurement technique, and a 

20 biomolecular component type. The application further describes a technology platform that 

facilitates the discernment of similarities, differences, and/or correlations not only within a single 
biomolecular component type within a sampler biological system, but also across two or more 
biomolecular component types. 

In a broad aspect, a method of profiling a state of abiological system includes evaluating 

25 with statistical analysis a plurality of data sets of a biological system and comparing features 
among the plurality of data sets todetermine one or more sets of differences among at least 
portion of the plurality of data sets. The action of comparing the features among the plurality of 
data sets can include direct comparison of one feature in a first data set to a corresponding 
feature in another data set. The actionof comparing the features also can mclude correlating or 

30 associating features between or among data sets such as correlations associated with and/or 
resulting from the statistical analysis, e.g., multivariate analysis. Based on the results of the 
evaluation and comparison, a profile for a state of the biological system can be developed. 
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Another method of profiling a state of a biological system in a mammal includes 
evaluating with statistical analysis a plurality of data sets for a biomolecular component type and 
comparing features among the plurality of data sets to determine ohe or more sets of differences 
among atleast a portion of the plurality of data sets; evaluating with statistical analysis a 
5 plurality of data sets for another biomolecular component type and comparing features among; 
the plurality of data sets to determine one or more sets of differences, among at least a portion of 
the plurality of data sets; and correlating the results of the above described analyses to develop a 
profile for a state of the biological system. , 

A further method of profiling a state of a biological system in a mammal includes 
10 evaluating with s tatis tic al ana lysis a plurality of data seiaj^mprising ^ measurements from aUeast 
twojbiomolecularro^ and comparing features ampng the plurality of data sets to 

determine one or more sets of differences among at leasttapoftipn^of the plurality of data sets; 
and developing a profile for a state of the biological system based on thexesultsof the above- 
described analysis. 

15 Central; to: the methods and systeods described is the: analysis of a plurality of data 

•setsV !pie.pl . 
sample : type, more than one type of ih^urement teclm more than one biomolecular 
component type, or a combination of at least tV^ sample type, a measurement 

tedimqiie, and a biomolecular component type. The biological system preferably is in a, 

20 mammal, such as a human. A biomolecular. component tsre includes a protein, a glycoprotein, a 
gener a gene transcript, and a metabolite. 

A biological sample type includes, -among others, blood, plasma, serum, cerebrospinal 
fluid, bile, sativa^ynoyial fluid, pleural fluid, pericardial fluid, peritoneal fluid, sweat, feces, 
nasal fluid, ocular fluid, intra^Uular fluid,. interceUular fluid, lymph, urine, liver cells, epithelial 

25 cells, endothelial cells, kidney cells, prostate cells, blood cells, lung cells, brain cells, slrih cells, 
adipqse>cells, tumor cells, and mammary cells. Data sets can include measurements from one 
biological sample type that is treated different or from one biological sample type that is 
collected or analyzed at different times. 

A measurement technique includes, among, others, liquid chromatography, gas 

30 chromatography, high pe^^ electrophoresis, mass 

spectrometry, liquid chromatography-mass spectrometry, gas cbromatogrq>hy-mass 
spectrometry, ^high performance liquid chromatography-mass spectrometry, capillary 
electrophoresis-mass spectrometry, nuclear magnetic resonance spectrometry, parallel 
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hybridization assay, parallel sandwich assay, and competitive assay . Data sets can include 
measurements from different instrument configurations of a single type of measurement 
technique. 

Subsequent to developing a profile for the state of a biological system, the profile can be 
5 compared to aprofile of another state of a biological system, Where the biological systems are 
the same or different A profile also can be compared to a database of profiles to evaluate 
whether die state of the biological system matches or is similar to a known state. The methods 
described herein may be carried out by an article of manu&cture having a computer-readable 
medium with' computer-readable instructions embodied mereon for performing the methods. 

figures, detailed description, and claims, all of which musuate the princhiles of the invention by 
way of example only. 

Brief Description of the Figures 

The foregoing and other objects, features, and advantages of the invention described 
15 above Will be more fully understood from the foUowing description of various illustrative 
. embodiments, when read together with the accompanying drawings. In the drawings, like 
reference characters generally refer to the same parts throughout the different views. The 
, drawings are not necessarily to scale, and emphasis instead is generally placed upon illustrating 
the principles of the invention. 
20 Figure 1 is a schematic flow diagram illustrating the integration of genomic, proteomic, 

metabolomic and clinical data sets to develop a profile of a biological system. 

Figure 2 is a flow diagram of various analytical and processing steps as applied to a 
plurality of data sets according to an illustrative embodiment of the invention. 

Figure 3 illustrates the experimental design of the ApoE3-Leiden transgenic mouse gene 
25 expression experiment. 

Figure 4 illustrates a significance plot for the gene expression experiment 
Figure 5 illustrates a significance plot for the selected 1059 peptide peaks from four liver 
fractions. 

Figure 6 illustrates a block design for the synthetic data GIST experiment 
30 Figure 7 illustrates scatter plots and a nornial probabmty plot for variety 1 of the 

synthetic GIST data set 

Figure 8 illustrates scatterplots and a normal probabmty plot for variety 2 of the 
synthetic GIST data set 
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Figure 9 illustrates scatter plots and a normal probability plot for variety 3 of the 
synthetic GIST data set 

Figure 10 illustrates a significance plot for the synthetic GIST data.set 

Figure 1 1 illustrates a flow, diagram that describes the treatment of the gene expression 
data derived from a biological sample. 

Figure 12 illustrates a flow diagram that describes the treatment of the protein data 
derived from a biological sample. 

Figure 13 illustrates a.flow diagram that describes the treatment of the metabolite data 
derived from a biological sample. 

Figure 14 illustrates a flow diagram that describes the integration of a phiraUtyjof-data — 
sets derived from two or more biomolecular component types.; 

Figure 15 illustrates a gene expression analysis thai reveals mRNA<abundance. 

Figure 16 illustrates results for selected groups from a.gene expression analysis. 

Figure 17 illustrates results for selected groups from a gene expression analysis. 

Figure 18 illustrates intensity, plots of LC/MS total ion chromatograms of proteins from . 
plasma samples. 

Figure 19 iUustratfe total ion chromatograms frbin LC/MS profiling of proteins from 
plasma samples. 

Figure 20 illustrates LC/MS ^dn-oihaiograms acquired from the digested liver proteins of : 
five transgenic and five vwldtype mice* 

Figure 21 illustrates J H NMR spectra of metabolites extracted from plasma from 
transgenic and wildtype mice. 

Figure 22 illustrates mass chromatograms of plasma lipids recorded using LC/MS for 
transgenic and wildtype mice. / 

Figure 23 illustrates individual gene, protein, and metabolite spectra that are normalized 
and then concatenated to form a single factor spectrum for comparison across individual 
biomolecular component types. 

Figure 24 illustrates clustering of wildtype and transgenic mice data resulting from 
Principal Component and Discriminant ("PC-DA") statistical analysis. 

Figure 25 illustrates a difference factor spectrum of peptides exhibiting significant 
differences (note mlz value 1366). , 
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Figure 26 illustrates a mass spectrum anda sequence of a peptide (m/z value 1366) from 
mouse plasmarecorded using LC/MS/MS, where the peptide deduced from the MS/MS 
spectrum is identified as residues 57-79 mlfo sequence of humanap6lipoproteinE3. 

Figure 27 illustrates a correlation network between biomolecular component types. 

Figure 28 illustrates a map of known relations between the correlation network 
associations and published information. 

Figure 29 illustrates typical 
("Markers") or therapeutic agents that can be derived from a systems biology analysis. 

Figure 30A illustrates the experimental design of the ApoE3-Leiden transgenic mouse 
experiment 

Figure 30B illustrates a scatter plot ofthecDNA microanay data. 
Figure 31 A illustrates the LC/MS chromatograms for the digested liver protein fraction 
tec the ten samples. 

Figure 31B illustrates the clustering analysis of the tryptic peptide profiles. 
Figure 31C illustrates a fector spectrum ofthe liver protein data.- 
Figure 32A illustrates the clustering resulting from me principal component analysis of 
the liver lipid data set 

Figure 32B illustrates a fector spectrum of the liver lipid data set 

Figures 33 A, 33B, and 33C illustrate a comprehensive systems analysis based on data 

from three biomolecular component types, where a relative abundance of 1.0 is 100%. (Figure 

33A- mKNA; Figure 33B -protein; Figure 33C -lipid). 

Figure 34 is a schematic illustrating hyperlipidemia and atherosclerosis in a blood vessel 
Figure 35 illustrates a whole plasma parallel proteo-metabohc profilhig. scheme. 
Figure 36 illustrates NMR spectra for a wildtype mouse plasma sample (WT) and a 

transgenic mouse plasma sample (TG). 

Figure 37 illustrates a PC-DA score plot showing clustering ofNMR data for the 

transgenic mouse, represented by triangles, and the wildtype (or control) mouse, represented by 

circles. 

Figure 38 illustrates a difference spectrum characterized by a number of lines 
representing various metabolic components. 

Figure 39 iUustrates total ion chromatograms (TIC's) for deproteinated Hpid fractions 
from transgenic (TG) mice and wildtype (WT) mice analyzed by a 4-step gradient in the LC 
dimension with mass spectrum acqjrired over 2004700 m/z mass range. 
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Figure 40 illustrates total ion chromatograms from transgenic (TG) mice and wildtype 
(WT) mice protein fractions obtained from tryptf c peptides. 

Figure 41 illustrates a score plot showing PC-DA clusters for the wildtype (WT) and 
transgenic mouse (TG). 

5 Figure 42 illustrates difference factor spectra for protein and metabolite components. 

Figure 43 illustrates a schematic representation of data analysis workflow. 
Figure 44 illustrates the workflow for an unsupervised clusteririg analysis for multiple 
platforms. 

Figure 44A illustrates COSA unsupervised clustering of LC/MS proteomic datai, 

10 revealingfo ur distin ct clusters; ^ . — • 

Figure 44B illustrates COSA unsupervised clustering of multiple data sets that have been 
concatenated. 

Figure 45 illustrates the workflow fpr selecting and comparing components of one 
sample that are different from another sample. 
; 15 Figure 45 A illustrates a representative graph of selected protein, lipid, >and metabolite- 

differences between rat groups.ddentified using the uhiyariate statistical method. 

Figure 46 illustrates a correlation network for. the comparison between drug-treated 
diseased rodents and vehicle-treated diseased rodents (drug effect on disease). 

Figure 47 illustrates an intensity plot visualization of correlations between pairs of 
20 components in the drug-treated diseased rodents and vehicle-treated diseased rodents-(drug effect 
on disease). 

Figure 48 illustrates a plot showing ratios between groups based on the means of thepeak 
intensity values within each group (after normalization and scaling) related to peptides from 
certain proteins. 

25 Figure 49 illustrates COSA distance clustering using human LC/MS lipid peaks. 

FigureS0 illustrates the workflow for a comparison and correlation of human sample data 
with non-human sample data. 

Figure 50A illustrates the results of a COS A analysis of human serum samples in which 
the input data set used for classification consisted of 366 lipid peaks chosen from the rodent 
30 model of the human disease. 

Figure 51 illustrates the success rate of an SVM linear classifier as a function of number 
of lipid peaks. 
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Figure 52 illustrates a comparison of lipid abundance changes and correlations across 
human and rodent species . 

Figure 53 illustrates the workflow for analysis of several data sets. 
Figure 54 illustrates.a graphical representation of selecting analytes for a biomarker. 
5 Figure 55 illustrates the performance of a fifteen analyte biomarker in grouping samples- 

Figure 56 Dlustrates the list of analytes from Figure 55. 

Detailed Description of the Invention 

The methods and systems disclosed herein rely on multiple measurements of biological 
samples, including analysis of metabolites, proteins, genes and gehe feanscaipts, to permit a 
—10 -« skilled artisan to understand a biological-system in greater deptfethan an approach that examines 
only one of these factors. Understanding the biological system as a whole can improve multiple 
aspects of pharmaceutical discovery and development, including drug safety and efficacy, drug 
response, and the etiology of disease. As described herein, a systems biology platform can 
integrate genomics, proteomics and metaboloimcs, and Woinfoimatics, and results in a data 

15 integration and knowledge management platform that generates connections, correlations, and 
relationships among thousands of measurable molecular components to develop of a profile of a 
state ofa biological system. Resulting profiles can be combined with clinical information to 
increase Hie knowledge of a state of a biological system. 

A '^profile" of a biological system is a summary or analysis of data representing 

20 distinctive features or characteristics of the biological system, e.g., of a mammal such as a 

human. The data can include measurements or features derived from a biological sample type, a 
type of measurement technique, and a biomolecular component type. The date often are spectral 
or chromatographic features that are in the form of a graph, table, or some similar data 
compilation. A profile typically is a set of data features that permit characterization of a state of 

25 a biological system. 

A profile can be considered to include one or more 'Triomarkers" of a biological system. 
A biomarker generally refers to a biological component type, e.g., a gene, a gene transcript, a 
protein or a metabolite, whose qualitative and/or quantitative presence or absence in a biological 
system is an indicator of a biological state of an mammal. Thus, a profile can be considered to 

30 be a set of distinctive Triomarkeis, e.g., spectral or chromatographic features, that permit 

characterization of a state of a biological system. A profile also can be considered to include 
correlations and other results of analyses of the data sets, e.g., causality. Thus, a profile can 



WO 2005/020125 



10 



PCT/US2004/027022 



comprise a plurality of different elements as described above, or can comprise only one of these 
elements, e.g., biqmarker(s). 

A "state of a Biqlbgic^%^em"refers^to a condition in which the biological system 
exists, either naturally dr after a perturbation. Examples of a state of a biological systemincludei 
5 but are hot limited to; a normal or healthy state, a disease state, a pharmacological agent 
response, a toxicologic^ state, a biocheniical regulation (e.g., apoptosis), an age response, an 
environmental response, and a stress response. The biological system preferably is in a mammal, 
which includes humans and non-human mammals such as,mice,;rats, guineapigs, dogs, cats,, 
•monkeys, and the like; 

10 A proffle of a statejo^ permits the ,co5P!parisonj)if one profiled ~ 

another proffle to detennine whefiier ike profiles are in the same state, e.g., a healthy or a 
diseased state. A biological system is bett^ characterized using a multivar^e analysis .ratha: 
than using miUtiple measim^ementeipf 

the biological system as a whole. - Disparate data from multiple, different sources is treated as if 

15 ;ih a : ^^e dim(^ionira& Consequently, fttmalysis of datais 

more informative;^ thM&more robust ^d pfedictive than : one that 

is developed by systematically evaluate multiple components m 
paM^arbiomblecular component type f 

A 'Triomolecular. compon&it type'* refers to a class of biomolecules generally associated 

20 with a level of a biological systems For example, genes and gene transcripts (which may be 

interchangeably reiferredrto herein) ^ e^c^pleis of biomolecnlar component types that generally 
are associated with gene expression in abiological system/and where.thelevel of the biological 
system is referred to ias genomics oriunctional s genomics. Proteins and their constituent peptides 
(which may be interchangeably refened to herein),; are another example of a biomolecular 

25 component type that generally is associated with protein expression and modification, and where 
the level of the biological: system is referred to as proteomics. Glycoproteins also are considered 
a biomolecular component type. Another example of a biomolecular component typeis 
metabolites (which also may be referred to as small molecules), which generally are associated 
withalevel of a biological system referred to as metabolomics. Metabolites include, but are not 

3 0 limited to, lipids, steroids, amino acids, organic acids, bile acids, : eicosanoids, neuropeptides, 
vitamins, heurotiransmitt^s, carbohydrates, ionic organic^ nucleotides, inorganics, xenobiotics, 
peptides, trace elements, and pharmacophore and drug breakdown products. 
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The methods described herein may be used to develop aprofile of a state of a biological 
system based on any single biomolecular component type as well as based on two or more 
biomolecular component types. Profiles of biomolecular component types facilitate the 
development of comprehensive profiles of different levels of a biological system, e.g., genome 
5 profiles, transcriptomic profiles, proteome profiles and metabolome profiles, and permit their 
integration and analysis; That is, themethods may be used to analyze measurements derived 
from one or more biological sample type, one or more type of measurement technique, or a 
combination of at least one each of abiologicaLsample type and ameasurement technique so as 
to permit the evaluation of similarities, differences, and/or correlations in a single biomolecular 

10^ component type or across two r pr more biomolecular .component types- From these- 

measurements, better insight i^erlying biological mechanisms may be gained, novel 
biomarkers/suirogate markers may be detected, and intervention routes may be developed, 

A "biological sample type" includes, but is not limited to, blood, blood plasma, blood 
serum, cerebrospinal fluid, bile acid, saliva, synovial fluid, pleural fluid, pericardial fluid, 

15 peritoneal fluid, sweat, feces, nasal fluid, ocular fluid, intracellular fluid, intercellular fluid, 
lymph urine, tissue, liver cells, epithelial cells, endothelial cells, kidney cells, prostate cells, 
blood cells, lung cells, brain cells, adipose cells, tumor cells, and mamillary cells. The sources of 
biological sample types may be different subjects; the same, subject at different times; the same 
subject in different states, e.g., prior to drag treatment and after drug treatment; different sexes; 

20 different species, e.g., a hu man and a non-human mammal; and various other permutations. 
Further, a biological sample type may be treated differently prior to evaluation such as using 
different work-up protocols. 

A "measurement technique' 5 refers to any analytical technique that generates or provides 
data that is useful in the analysis of a state of a biological system. For example, measurement 

25 techniques include, but are not limited to, mass spectrometry ("MS"), nuclear magnetic 

resonance spectroscopy ("NMR"), liquid chromatography ("LC"), gas-chromatography ("GC"), 
high performance liquid chromatography ("HPLC"), capillary electrophoresis ("CE"), gel , 
electrophoresis ("GE") and any known form of hyphenated mass spectrometry in low or high 
resolution mode, such as LC/MS, GC/MS, CE/MS, MS/MS, MS", and other variants. 

30 Measurement techniques include biological imaging such as magnetic resonance imagery 
("MRP), video signals, and an array of fluorescence, e.g., light intensity and/or color from 
points in space, and other high throughput or highly parallel data collection techniques. 
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Measurement techniques also include optical spectroscopy, digital imagery, 
oligonucleotide array hybridization, protein array hybridization, DNA hybridization arrays, 
("gene chips 5 ')? immunohistochemical analysis, polymerase chain reaction, nucleic acid 
hybridizatioii, electrocardiography, computed axial tomography, positron emission tomography, 
5 and subjective analyses such as found in text-based clinical data reports; For a particular 
analysis, different measurement techniques may mclude different configurations of 

:settinjgs relating to the same mieaisurem 

A "measurement?' refers to an element of: a data set that is generated by a measurement 
technique, A "data set" mcludes measurements derived fom a one of more sources. For 

10 : exa mpie, a dat a set derived fromara • 
coUectedby the same technique, -i . e. , a collection or set of data of related measurements. 
Further;;data sets more b^ diverse clata, eig.-protein;expression 

data; gene expression data, metabolite concentration data, magnetic resonance imaging data; 
elj^ocardiqgram data, genotype data, single nucleotide polytnoi||hism dat^^ 

15 biological datau That is n f^ of a biological system being 

studied may serve as the basis M genera^gja given data set' 

A "feature" of a data set refeis to a paiti<^ar-ra with that data set 

that may be comparedito ^ another data set For example, ^ profile typically is a set of data 
features tot permit chamc 

20 Data sets may refer to. substantially all of a sublet of the date assodated wth one or 

* more measurement techniques. For example, the data associated with thespectrometric 

measurements of different sample sources may be grouped into different data sets. As a result, a 
.first data set may refer to experimental group sample measurements and a second data set may 
refer to control group sample measurements. In addition, data sets may refer to data grouped 

25 based on any other classification considered relevant For example, data associated with the 
spectrometric measurements of a single sample source may begroupedinto different data sets 
based on the instrument used to perform the measurement, the time a sample was taken, the 
appearance of a sample, or b&er identifiable variables and characteristics. 

Accordingly, one data set may include a sub-set of another data set For example, a 

•30 grouping based on appearance of the sample may include one or more experimental group data 
sets. Where the measurement technique is NMR, a data set may include one or more NMR 
spectra. Where the measurement technique is ultraviolet (UV) spectroscopy, a data set may 
include one or more UV emission or absorption spectra. Similarly, where the measurement 
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technique is MS, a data set may include one ox more mass spectra. Where the measurement 
technique is a chromatographic-MS technique, like LC/MS or GC/MS, a data set may include 
one or more mass cbromatograms. Alternatively, a data set of a chromatograpMc-MS technique 
may include one or more total ion current ( ct nC") chromatograms or reconstructed TIC 
5 chromatograms. In addition, it should be realized th^^^ 

spectrometric data and data that has beenpreprocessed, e,g., to remove noise, to correct a 
baseline, to smooth the data, to detect peaks, and/or to normalize the data. 

''Spectrometric data" refers to any dato tibat naay be reitfesented in the form of a graph, 
table, vector, array or some similar data compilation, and may include data firom any 
10 spectrometric or chro matpgr^Mc technique. The term tt ^ec^metric measurement" includes . 
measurements made by any spectrometric or <toomatogrq>hic technique. 

Central to the .methods disclosed herein is the statistical analysis of a plurality of data 
sets. "StarisM analysis" includes parametric analysis, npn-pafametric,analysis, univariate 
analysis, multivariate analysis, linear analysis, nonrlinear analysis, and other statistical methods. 
15 known to those drilled in the art Multivariate analysis, vftiich deteimines patterns in apparently 
chaotic data, includes, but is hot limited to, principal component analysis ("PCA"), disc ri minan t 
analysis ("DA"), PCA-DA, canonical correlation ("CC^, cluster analysis, partial least squares 
CTLS"), predictive Imear discriminant ianalysis ( C TLDA"), neural networks, and pattern 
recognition techniques* 

20 Of course before performing multivariate analysis, the raw data may be preprocessed to 

assist in the comparison of different data sets, In particular, to compare data across different 
biomolecular component types, appropriate preprocessing should be performed. Preprocessing of 
the data may include (i) aligning data points between data sets, e.g., using partial linear fit 
techniques to align peaks of spectra of different samples; (ii) normalizing the data of the data 

25 sets, e.g., using standards in each measurement to adjust peak height; (iii) reducing the noise 
and/or detecting peaks, e.g., setting a threshold level for peaks so as to discern the actual 
presence of a species from potential baseline noise; and/or (iv) other data processing techniques 
known in the art Data preprocessing can include entropy-based peak detection as disclosed in 
U.S. Patent No. 6,743364, and partial linear fit techniques (such as found in J.T.W.E. Vogels et 

30 al , 'Tartial Linear Fit: A New NMR Spectroscopy Processing Tool for Pattern Recognition 
Applications," Journal ofChemometrics, vol. 10, pp. 425-38 (1996)). 

Throughout the description, where compositions are described as having, including, or 
comprising specific components, or -where processes are described as having, including, or 
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comprising specific.process steps, it is contemplated that compositions of the present invention 
also consist essentially o£ or consist of, the recited components, nnd that the processes of the 
presentinvention also consist essentially o$ or consist o£ the recited processing steps. 

It should be understood that the order of steps or order for performing certain actions is 
5 immaterial so long as the invention remains operable, i.e., a profile of a biological system is 
developed. Moreover, two or more steps or actions may be conducted simultaneously. 

The methods described herein generally include evaluating wife statistical analysis a 
plurality : of data.sefc of ^a biological systems and comparing features among the data sets to • 
determine one or more sets of differdices among at least a portion of the data sets so as to 

10 develop a profile for a sta te of a lriologcal system based bn.the comparison.- In some — 
embodiments, the'rdata sets are derived from one or more biological sample types and include 
measurements derived from one or more measurement techniques; In other embodiments, the 
data sets are derived from two or morecbiological sample types and include one or more different 
types of spectrometry measurements of a sample of the biological system. 

IS, In certain embodiments, the and evaluated using multivariate 

analysis. In other embodiments, more than one statistical analysis is performed on the plurality 
of data sets, on varipiis permutations of the plurd%*of data sets, and/or on the results of a 
paiticidar statistical analysis; ;For example, a profile may be developed by separately evaluating 
a plurality of data sets including measure biological- system 

20 and a plurality of data sets including measurements derived from metabolites in the biological 
system, then evaluating with statistical analysis the results of the individual analyses to develop a 
profile for the biolb^cal system that includes both proteins and metabolites^ Alternatively, the : 
plurality of data sets relating to proteins and metabolites of the biological systems may be 
simultaneously evaluated with statistical analysis. 

25 Analogously, a profile can" be developed from data sets including measurements derived 

from a protein and a gene; a protein and a gene transcript; a gene . and a gene transcript; a gene 
and a metabolite; and a gene transcript and a metabolite. A profile also can be developed from 
data sets including measurements derived from a protein, a gene, and a gene transcript; a protein, 
a gene and a metabolite; a protein, aigene transcript and a metabolite; and a gene, a gene 

30 transcriptand a metabolite; and a protein, a gene, a gene transcript and a metabolite, In addition, 
each of the above permutations can include, in addition or as a substitution, a glycoprotein. 

Measurements for a particular biomolecular component type usually are generated by a 
measurement-technique or techniques that are often used and known in the art for that particular 
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biomolecxdar component type. For example^an analysis of metabolites may use NMR, e.g., ^ 
NMR; LCMS; GG/MS; and MS/MS. Analysis of other biomolecular component types may use 
LC/M3; GC/MS; arid MS/MS. 

In one embodiment, the method generally includes selecting a biological sample; 
5 preparing the biological sample based on the biochemical components to be investigated and the 
spectrometric techniques to be employed; measuring the components in the biological samples 
usmg spectrometric and chromato^aphic techniques; measuring selected molecule subclasses 
using 1SMR and MS-approaches to study compounds; preprocessing the raw data; using 
statistical analysis, which wiUte ^ analyze the preprocessed 

10 data to identify patterns in measuremeii^Qlsingle subclasses afmotectdes-or in measurements- v * 
of components using NMR or MS; and using statistical analysis to combine data sets from 
distinct experiments and identify patterns of interest in the data. 

the technology platform may also include nonnalizing a plurality of data sets to fecilitate 
comparison of the data across biomolecular component types. The invention sdso provides 
15 techmques for detenriiuing associations/correlations between biomolecular component types of 
suitable data sets using linear, non-linear or other mathematical tools. Moreover, using these - 
associations and/or correlations to postulate networks of interacting biomolecular components to 
determine causality among these associations, and to establish hypotheses about the biological 
processes underlying the observations which give rise to the data sets, is still another aspect of 
20 the methods and systems described herein. 

The application also provides an article of manufacture where the functionality of a 
method disclosed herein is embedded on a computerreadable medium such as, but not limited 
to, a floppy disk, a hard disk, an optical disk, a magnetic tape, a PROM, an EPROM, CD-ROM, 
or DVD-ROM. The functionality of the method maybe embeddedon the computer-readable 
25 medium in any number of computer-readable instructions or languages such as FORTRAN, 

PASCAL, C, C++, BASIC and assembly language. Further, the computer-readable instructions 
may be written in a script, macro, or fimctipnally embedded in commercially available software 
such as EXCEL or VISUAL BASIC In other aspects, the application provides systems adapted 
to practice the methods described herein. 
30 The data processing device may include an analog and/or digital circuit adapted to 

implement the functionality of one or more of the methods disclosed herein using at least in part 
information provided by the spectrometric instrument In some embodiments, the data 
processing device may implement the functionality of the methods described herein as software 
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on a general-purpose computer. In addition, such a program may set aside portions of a 
computer's random access memory to provide .control logic thai affects the spectrometry 
measurement acquisition, statistical analysis of data 1 sets, and/or profile development for a 
biolojgical system Ih such an ianbodimeht, the program may be written in any one of a number 
5 of high-level languages, such as FORTRAN,- PASCAL, C, C-H-, or BASIC. Further, the 

program can be written in a script, macro, or functionality embedded in proprietary software or 
roinmexciaUy avaUable. software, suci as EXCEL or "WBtlAL BASIC. Additionially, the 
software coiildbe implemented in an assembly language directed to a microprocessor resident on 
a computer. For example, -the software can be implemented in Intel 8fe86. assembly language if 
10 jitisconfigu^^runonanffi 

of inanufecture including, but not limits to, a c^mputer-readable^p such as a 

floppy disk, a hard disk, an optical disk, a majgnetic tape, a PROl^t, an EPROM, or GDtROM. 

As shown in Figure 1 , in some embodimisnts, the m€&od;begms with parallel analyses of 
jg^ti^c^^ 

I S s^ples>i^actefr The mean qumHties^as well as 

the^ 

Nuch as pattern recognition to identifyJmoleculeslo link gehere^ activity, and 

metabolite dynamics: The methods disclosed herein, coined BioSystematicsTM, then can be 
employedto translate covafiant sets of genes including gene transcripts, proteins, and 

20 metabolites* optionally with -c^ into ain understating of thek biochemical 

interaction to elucidate a profile of a bioldgicfiil system and/target MonnadoiL This information* 
the extent to which particular groupsof molecules co-vary, and exi^g pathway knowledge 
then are used to assemble molecular networks and place compounds in their biological context 
so as to develop a profile of a state of the biological system. 

25 Figure 2 ;shows a.flow chart of one embodiment of an analytical method 200. It should 

be understood4hat one or more of the steps described below can be omitted and/or the order of 
;steps can be changed so long as the qnbodiment remains operable, i.e., capable of developing a 
profile of a state of a biological system. One or more data sets 205 taken ftom two or more 
biomolecular component types are subjected to an initial preprocessing step.210 prior to further 

30 data analysis. In a preferred embodiment, the initial processing step typically includes 

concatenating one or more of the plurality of data sets. This initial preprocessing step may also 
include integrating together the data sets based on a suitable schema or data hierarchy. In some 
embodiments, the initial processing step includes both a concatenation step and an integration 
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step. The initial processing optionally may include, follow, or precede various forms of 
preprocessing including, but not limited to, data smoothing, noise reduction, baseline correction, 
and peak detection. 

The data sets that are the subject of the initial preprocessing step may include any 

5 measurable or quantifiable aspect of the biological system being studied. Fpr example, the data 
sets may represent collections of, e.g., protein expression data, gene expression data, metabolite 
concentration data, magnetic resonance imagmg data, electrocardiogram data, genotype data, 
and/or single nucleotide polymorphism data. Statistical methods such as principal component 
analysis may be utilized to convert the data sets to factor spectra, which are singly a processed 

10 form of the raw data. _ . - — — • — — * ■ 

Means for comparing data sets of completely unrelated phenomena with disparate units 
of measure is necessary, especially given the broad range of data sets that may be employed. 
Referring to Figure 2, for such disparate data sets, a normalization step 215, which is described 
m more detail below, may be implemented: Genei^y, individual data sets are normalized by 

15 scaling the data set with opthnal scaling parameters calculated using amaximum likelihood, 
estimator. Normalization facilitates comparison of data sets taken from one or more 
bioinolecular component types. 

• An extraction step 220 is typically performed on the processed data. In the extraction 
step, one or more list(s) of components, ^ which e^diibit statistically significant changes, are 

20 extracted The components typically are biological component types, or more specifically 
biomolecular component types. Further, these changes also are quantified aspart of the 
extraction step. The extraction step typically involves a statistical analysis to discern the 
differences and/or similarities betweenthe data sets. The extraction step and associated 
quantification of differences facilitates discerning similarities, differences, and/or correlations 

25 between or among two or more biomolecular component types for the biological sample under 
investigation. 

Suitable forms of statistical analysis appropriate for quantifying the change between 
component types include, e.g., principal component analysis ("PCA")> discriminant analysis 
("DA"), PCA-DA, canonical correlation C'CC"), partialleast squares ("PLS"), predictive linear 
30 disoiminant analysis ("PLDA"), neural networks, and pattern recognition techniques. In one 
embodiment, PCA-DA is performed at a first level of correlation that produces a score plot, i.e., 
a.plot of the data in terms of two principal components. Subsequently, the same or a different 
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statistical analysis is performed on the data sets based on the differences and/or similarities 
discerned from previous analysis. 

For example, in one embodiment, where a processed data set includes a PCA-DA score 
plot; the next level of statistical processing may be a loading plot produced by a PCA-DA 
5 analysis. This second level of correlation bears a hierarchical relationship to the firstlevel in that 
loading plots provide information on the contributions of individual input vectors to the PCA- 
DA that in turn are used to produce a score pldt For example, where each data set includes a 
plurality of mass chromatograms, a point on a score plot represents mass chromatograms 
originating from one sample source. In comparison, a point on a loading plot represents the 

1 0 contribution of a particular mass or range of masses to the correlations between data sets. — : — — ' 
• Similarly, where each data set includes a plurality of NMR spectra, a point oh a score plot 
represents one NMR spectrum. In comparison*, a point on the corresponding loading plot 
represents the contribution of a particuljtt NMR chemic^ shift range of values to the 

correlations between data sets. 

15 Figure 2 also depicts a correlation. network production step 225, winch follows the 

extraction step 220. The formulation of the correlation networks indicates potential associations 
among the extracted list of components develc^edpre^busly by the preceding step. A 
correlation n^ork is a Tepr^entation (g^ otherwise) of the 

biomoledular ^ component typeis of a system that vaiy in abundance between one or more groups 

20 of samples. Two components are "correlated" if they vary in a somewhat synchronous manner. 
For example, if both a gene and a protein are upregiilated in group 1 as compared to group 2 and 
the upregulation is consistent across aU the biological^ samples including group 1, then the gene 
and protein are considered to be "correlated." Analogously, bidmoiecular component types also 
may be anti-correlated. Moreover, different "strengths of correlation" exist, which depend on 

25 how tightly synchronous the relationship is between or among the two or inof e biomolecular 
types. 

A comparison step 230 is performed after the correlation networks have been established. 
The corrdation network associations, which encompass both. correlations and anti-correlations, 
.are compared and evaluated based on existing knowledge of the component or biological system 
• 30 under investigation. This knowledge relates to the associations which may be ascertained from 
established sources such as research literature and/or experimental studies.. 

Subsequently, a perturbation step 235 typically is performed as part of the larger analysis. 
The biological system subject to investigation is typically perturbed by changing an experimental 
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parameter and monitoring the system for a prescribed amount of time. Examples of 
perturbations include, but are not limited to, introducing a drug, altering a gene, changing an 
environmental condition, or making another suitable change. A perturbation also encompasses 
the idea of comparing across species, i.e., performing the workflow on an animal system and 
5 performing substantially the same workflow on a human system to investigate the similarities 
and/or differences between or among species. 

Following the perturbation step 235, new data sets and correlation networks are produced 
240. Thus, as a result of the perturbations introduced into a given biological system or sample, 
new data sets arise that are measurable. Similarly, as part of step 240, new corrd^on networks 
10 may be developed based on ^ose novel £bsttperturbation data sets. J&e^tistieally significant- ' * 
changes in the new data sets, as determined in comparison to the pre-perturbatibn data sets, are 
discerned by comparing the statistically significant biological component types in the hew data 
sets with the component types of the previous experimental results 245. In addition to looking at 
the statistical changes between biomolecular component types before and after system 
15 perturbation 245, correlation networks may be malyzed in kind. Therefore, the correlation 
network association networksmay be compared before and after perturbation 250. After these 
two levels of comparison 245, 250 have been perfonned, alterations or changes between 
components and associations can be identified 255. 

Thereafter, perturbations to the system being investigated can be iterated 260 . A 
20 feedback loop results among the initial perturbations to the system, the system itself the 
production of new data sets, the comparison of significant components with the previous 
experiment, the comparison of new correlation network associations with previous associations, 
and the identification of changes. The feedback loop may be iterated until causal relations can 
be identified 265 between multiple biomolecular component types and the correlation and 
25 networks which characterize their impact on the biological system. 

Referring back to the normalization step 215 in Figure 2 and introduced above, a method 
for normalizing gene expression data, protein data, and metabolite level data is now described. 
A sample variety effect, an array effect, and a dye effect are introduced into a log-linear model, 
and a maximum likelihood maximization technique is applied to calculate all the parameters of 
30 the model and determine the optimal scaling factor for each array and dye. The normalization 
method is generic and can be applied to a variety of data, experimental setups, and designs. The 
model described below uses terminology from gene expression analysis. For example, the 
"array Vmproteomics experiment could be one mass spectrometer run, and the "dye" could 
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describe ail samples used during the single nnt 'Nevertheless, other biomolecular component 
types could be andyzed using the model described below. 

Normalization model. The data matrix x is characterized by the gene index 
g(g '= 1 ..M 8 ) , array index i(i = ) , dye index k(k = 1 .:Jtf k ) , and the variety index 
5 v(v = L. JVy) . For each variety v, there are Q, samples corresponding to it, so 

"~ ~^iW* m Siace variety assignment is a function of array and dye indices, each 
data point is uniquely ^ described by indices g } zyanii k. For Convenience the matrix is 
tramformed lpgeirithmcally: 

10 Dafr 

Jfct. = fy, ■■+ 4 ; + A (2) 
where;the gerie and variety effects are described by , the array effect by 4, the dye effect by- 
Di |, . and the eirorifiinction by ^ - The error function is assumed to be normally distributed with 
zero mean and ihb variance p*, 9 ief.^the vmmce is pennitted to be different for each gene and 
15, variety. The variety index v; is a unique Miction of / and k fi and can be written as {?, &} e v . 
Since the jjene and variety^ array, and dye effects ^e assum^ to be fixed, the distribution of 
regression levels can be described as: 



2ai 



A maximum likelihood estimation is used to calculate the optimal sraling paramieters used to 
20 properly normalize the data. Solving for the parameters p sv , A n D k , and cr sv leadsto the 
foUowing;eqiktidns: 



- 1 w - ^ (4) 
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The optimal scaling factors for each array and dye are then: 

j tt ^-Ai-Di, (5) 
so the normalized egression levels are: 

Significance tests and bootstrap methods. Hie normalized data may be compared to a 
null model, and ap-value may be calculated that measures the probability that the deviation of 
the data from the null model can be attributed to the random error. The parameter used for 
comparison is the fold ratio between the two chosen varieties. To evaluate the method, a t-test is 
performed to compare 1be two chosaa varieties. [Sheskin, Handbook of Parametric and 
Noriparametxhrf^ Chapman & Hall/CRC, Boca Raton, FL (2000).] the corresponding 

p -values were calculated for each gene. When assessing the statistical significance of fold 
change for each gene, one needs to take into consideration the total N g ^-values calculated, as 

P < y N 

several p -values with / * are expected. To account for this, the overall likelihood, 
PQ?), of observing a p -value <p for any of the N g genes is used. Assuming independence of 
all Igenes, the overall likelihood is estimated with: 

P(j>)*l-(l-pf*. (7) 
Assuining independence of genes is obviously an oversimplification, and the correct way 
to calculate^p-values and P{p) values is by using the bootstrap method with the parameters 

[ju^Af.D^ (Tp) of the null model being used to general random data sets. 

Example 1. Normalization of Gene Expression Data from Hie Liver of anAPOE*3-Leiden 
Transgenic Mouse 

To illustrate the normalization method, a study of the ApoE3-Leiden transgenic mouse 
was performed. A total of 9,596 genes were analyzed using ten cDNA microarrays. Samples 
were collected from a total of four ApoE3-Leiden transgenic (TG) mice and four wild type (WT) 
mice. An optimized deagn of the e^erime^ The variety vector was 

therefore 

112211221122112222 l]. (8) 

A t-test was applied, comparing the normalized values of transgenic and wild type mice. 
Figure 4 shows the significance plot of the data based on j?-values from the t-test and fold ratios. 
The horizontal line on top shows the overall likelihood P(p) = 0.05 cutoff while the lower line 
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shows the cutoff p = 0.05. Only 16 genes satisfy the most stringent former criterion, while there 
are 713 genes inthe p < 0.05 range. 

Protein data from liver. Eight samples from eight different animals, four transgenic and 
four wildtype, were analyzed in eight experiments. The variety vector is therefore: 
5 ^ = [1 1 1 122 2 2]. (9) 

Mass speclromelxy CMS) spectra were selected from a total of four fractions, each 
containing 1600 peaks. The MS spectra w^processed using the IMPRESS algorithm, which 
^was developed at the University of Leiden arid is described in U.S. Patent No. 6,743,364. 
IMPRESS peak clwracterization software uses an'Mormatibn theoretic measure (IQ)fo 
TO ~ "Ermine peSfe significancei(between 0 and 1), A p«ak in the data;set with IQ>0.5 was retained 
foraimjo^ A tpM of 105$^^ 5 

*pm fraction l,,271ih fraction 3,454 in fraction 4^ and 329 in fraction & the significance plot 
is&own in Figure There are no p^ ^tisfying the J?]^) = 0.05 cutoff criterion, while there 
r : are 84peaks with p< 0.05. kthis cas^moie dataare necessary to determine ff npimalizatipn 
15 &ould be petfori]^ 

. Synthetic ^GISTP data* To perform >a test of the;nonnaliz^on^metibpd on data with 
higher mmiWof dyes; ^ 2000 peaks,,5 dyes, 3 varieties, and 

6 experiments wasperfonnei Tbiscouldtpoten^^ 

performed using the Global Internal Standards Technology .("GIST") [Ghakraborty, A. and 
20 Regnier, F., J. Chromatog. A 949, .173-84 (2002)] The experiment design is shown in Figure 6 
and can also be described by the following variety vector 

Vdrs=[l 122322 1 1.331 12232 2 1 1 1 1 322223 1 1]. (10) 
The background for each peak has been selected'using Gaussian, random number 
generator, set to equal mean and variance; Three large peaks have then been added for each of 
25 the variety 1 and 2, respectively, while variety 3 has been kept as control. Figures 7-9 show the 
scatterplots and normal probability plots for each of the varieties; The three outliers are clearly 
seen for varieties 1 and 2. The fold ratio: 

= (^2) 

(Varietyl) ' V l) 
was calculated for each peak, and a t-test was used to compare the two varieties. The 
30 significance plot is shown in Figure 10. As expected, only six outliers satisfy the 
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P(p) = 0.05 cutoff criterion, while there are a total of 94 peaks satisfying p < 0.05 , despite the 
feet each peak (except the six outliers) has been generated randomly for each sample 
independently. 

Dlustrative examples of the work flow in Figure 2. Three additional exainples are 
5 disclosed herein to further illustrate the experimental methods, techniques, and analytic 

approaches outlined in the flow diagram illustrated in Figure 2. More detailed flow diagrams are 
presented in Figures 1 1, 12, and 13, which describe preparing a data set from a biological sample 
and theii extracting a list of either genes, proteins, or metabolites that exhibit a change in 
abundance above the threshold value. Figures 11, 12, and 13 can be understood as a higher 

10 resolution picture of Figure 2, andLin p^cular r focusing on Steps 205 through 220 in Figure"2r~" * 

Figure 14 illustrates inte^^ting the extracted list of components to produce correlation networks 
that can be used to compare the network associations wth associations known in the literature 
(Steps 220, 225 and 230 inFigure 2). To provide an even finer resolution picture of the 
illustrated embodiments, individual Figures 15-29 are presented, which map directly onto 
1$ individual steps shown in Figures 2, 11, 12, 13 and 14. 

Example 2. Systems Biology Analysis of the APOE*3-Leiden Transgenic Mouse 

As a test case for the application of systems biology analysis to a mammalian system, the 
apolipoprotein E3-Leiden (APOE*3-Leiden, APOE*3) transgenic mouse was selected. Apo E is 
a component of very low density lipoproteins (VLDL) and VLDL remnants and is required for 

20 receptor-mediated re-uptake of lipoproteins by the liver. [Glass and Witztum, Cell 104, 502 

(198?).] The APOE*3-Leiden mutation is characterized by a tandem duplication ofcodons 120- 
1 26 and is associated with familial dysbetalipoproteinemia in humans* [van den Maagdenberg ei 
a/., Biochem. Biophys. Res. Commua 165, 851 (1985); and Havekes et al., Hum. Genet 73, 157 
(1 986) J Transgenic mice over expressing human APOE*3-Leiden are highly susceptible to diet- 

25 induced hyperlipoproteinemia and atherosclerosis due to diminished hepatic LDL receptor 
recognition, but when fed a normal chow diet they display only mild type I (macrophage foam 
cells) and H (fatty streaks with intracellular lipid accumulation) lesions at 9 months. [Jong et ah, 
Arterioscler. Thromb. Vase. Biol. 16, 934 (1996).] 

APOE*3~Leiden transgenic mouse strains were generated by microinjecting a twenty- 

30 seven kilobase genomic DNA construct containing the human APOE*3-Leiden gene, the 

APOC1 gene, and a regulatory element termed the hepatic control region that resides between 
APOC1 and APOE*3 into male pronuclei of fertilized mouse eggs. The source of eggs was 
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superovulated (C57B1/6J x CBA/J) Fl females. Transgenic founder mice were further bred with 
C57B1/6J mice to establish transgenic strains. Transgenic and non-transgenic littermates of F21- 
F22 generations were used in these experiments. All mice were fed a normal chow diet (SRM- 
A, Hope Farms, Woerden, The Netherlands) and sacrificed at nine weeks, at which time plasma, 
mine, and liver tissue samples were taken and frozen in liquid nitrogen. The samples from each 
individual were then subdivided for separate gene expression, protein, and metabolite analyses. 
The results of combined mRNA expression, soluble protein, and lipid differential profiling 
analyses applied to liver tissue, plasma, and urine taken from wild type and APOE*3-Leiden 
mice that were fed a normal chow diet and sacrificed'at 9 weeks of age are presented below. 
Wildtype mice are used as a tqolto c^p^A^JShararteri of he transgenic mice^-orirother 
words, as control mice. 

With reference to Figures l lrl3, ffie biologicai condition 1105, 1205* 1305 to be 
investigated is lipid metabolism in a transgenic mammalian system, specifically atherosclerosis 
and byperlipidemia in an APOE*3-Leiden transgenic mouse. The siamples collected 1110, 1210, 
1310 were from Uve^tissue, plasma, and urine taken from the transgenic mice. 

Liver gene esqpression. Referring to Figure 1 1, total mRNA was extracted from 
iomogenized liver tissues using commercially bought, RNAeasy kits (Qiagen, Germantown, 
Maryland)'. mKNA was then extracted 1115 from the total RNA preparations using a 
commercially bought, Qligotex kit (Qiagen, Gennantbwn, Maryland). Gene expression 
micrbarray data were acquired using the Mouse UniGene i spotted cDNA array (Incyte 
Genomics, St Louis, Missouri). In one embodiment, an analysis of variance (ANOVA) model 
was selected for the design of the sample pairings that optimally reduces variation inherent in the 
technique; 

A mRNA abundance experiment 1120 was performed on the liver tissue. In one 
^embodiment, the experiment includes mRNA hybridization. Send analysis of gene expression 
and/or pattern recognition may be performed In one embodiment, a PARC pattern recognition 
program is.used. Figure 15 illustrates a mRNA abundance experiment In particular, a gene 
expression analysis is illustrated by a mouse liver mRNA expression ratio plot for APOE*3 
transgenic mice versus wildtype mice. Examples of gene expression data sets 1125 include not 
only the liver gene expression analysis illustrated in Figure 15, but also the gene expression data 
illustrated in Figure 16 and the gene expression abundance results illustrated in Figure 17. 
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Profiling of proteins extracted from the liver and plasma. Proteins were extracted 
1215 from frozen liver tissue and plasma samples 1210. Chromatography steps 1220 may be 
utilized to farther characterize the sample. In one embodiment, the proteins are chemically 
modified 1225 following the chromatography step 1220. In another embodiment, the proteins 
5 are fragmented info peptides 1230 following either the chromatography steps 1220 or the 
chemical modification step 1225. In one embodiment, fragmentation 1230 is performed by 
partial hydrolysis of the proteins. A second chromatography step 1235 may follow the 
fragmentation step 1230, and amass spectrometry step 1240 may follow the chromatography 
step 1235. M one embodiment, a PARC pattern recognition program is used to quantify the 

10 proteins, A^GISTisotopic labeling methQd may akoirejitilizefl; Identification of the proteins- ' 
may be performed with either mass spectrometry orBioSystematics. 

Examples of protein-derived data sets 1245 are shown in Figures 18-20. Figure 18 
illustrates intensity plots ofLC/MS total ion chromatograms (TTC's) of plasma from APOE*3 
transgenic mice vs. wildtype mice. In Figure 19, TtC's frdm LG/MS profiling,, which can 

15 elucidate subtle detectable differences, are shown. Both Figures 18 and 19 illustrate the 

complexity of a data set 1245, as they are included of greater than 1000 peptide peaks. Figure 20 
illustrates LC/MS chromatograms acquired from the digested liver proteins of five transgenic 
mice and five wildtype mice. In one embodiment, LC/MS is performed using an LCQ DecaXP 
flhermoEinnigan, San Jose, CA) quadrupole ion trap mass spectrometer system equipped with 

20 an electrospray ionization (ESI) probe. 

Profiling of metabolites extracted from urine and plasma. Metabolites were extracted 
from the urine and plasma samples 1310. The urine samples were profiled using one 
dimensional, 'HNMR 1315. NMR spectra are one example of a data set 1340. A data set 1340 
also may be generated from the plasma data by a chromatography step 1320, and then followed 
25 by a chemical modification of the metabolites 1325. The modified metabolites 1325 may be 
characterized by* series of chromatography X330 and mass spectrometry 1335 steps to generate 
a data set 1340. In one embodiment, the plasma samples are ionized by ESI and characterized 
using LC/MS. 

Examples of metabolite data sets 1340 are shown in Figures 21 and 22. Figure 21 
30 illustrates *B NMR spectra of metabolites extracted from plasma for APOE*3 and wildtype 
mice. After referring to the -CH3 signal of MeOD (S = 3.30), line listings were prepared using 
the standard Varian NMR software; To obtain these listings, all resonances in the spectra above 



WO 2005/020125 



26 



PCT/US2004/027022 



a threshold corresponding to about three times the signal-to-noise ratio were collected and 
converted to a data file format suitable for statistical analysis plications. Figure 22 illustrates 
mass chromatograms of plasma lipids recorded using LC/MS for APOE*3 and wildtype mice. 

Cqmbiiung Data Sets. Referring back to Figures 11-13, in one embodiment, the gene 
1125, protein 1245, and metabolite 1340 data sets are analyzed in parallel to determine 
molecular functions and elucidate cellular mechanisms. A number of bioiufonnatics tools can be 
utilized.to link gene response, protein activity, and metabolite dynamics. The data sets 1125, 
1245, 1340 are subjected to a data preprocessing step 1130, 1250,1345 (or 210 referring to 
Figured). An IMPRESS algorithm may be used to reduce background noise in both LC/MS 
chromatogramsand NMR spectra.- Inanp&er .e^ 
generate IQ files for input into die PARC algorithm. 

M one^bb^i^ 1250, 1345 is 

treated with a statistical analysis step 1135, 1255, 1350,. Suitable forms of statistical analyses are 
described in more detail above; The pfeprocessed d^ may be riormaJized using an ANOVA 
algorithm. Li another embckl^ nonnalizaitibn occurs . after the statistical analysis step, which 
may be performed on -the data sets using the PARC algorithm. In one embodiment, 
differentiating spectral components are identified in the fector spiectra generated' by the statistical 
analysis, 

Figure 23 depicts spectra treatedby the normalization step 215. Individual gene, protein, 
and metabolite specta are normalized using the model described above, and then the individual 
normalized spectra are concatenated into a single fector spectrum. In Figure 23, the data 
measifred on a biological sample extracted from mouse liver. Using thexoncatenated ^ctriim, 
direct comparison across bibmblecular component types may be performed 

♦ Figures -24-25 provide an illustrative embodiment of the statistical analysis step 1135, 
1255, 1350andthe subsequent inspection step 1140, 1260, 1355. For &e sake of simplicity, 
;only the protein plasma analysis is presented, but the method can.be extended to both genes and 
metabolites. Figure 24 illustrates clustering of wildtype mouse data and APOE* 3 transgenic 
mouse data performed using.a PC-DA 1255 on the peptide ion mass data. An inspection 1260 of 
the two distinct clusters shown in Figure 24 reveals that the masses of the ions differentiate the 
two clusters. Figure 25 shows the masses of the peptide ions exhibiting significant differences 
plotted in a difference fector spectrum. In one embodiment, at-test is applied to each of the 
diflferentiatmg ions to test their significance. In another embodiment, loading plots are used 
instead of factor spectra. 
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An additional mass spectroscopy analysis step 1265, 1360 may he performed to analyze 
further the proteins, peptides, or metabolites that exhibit a change above a threshold abundance 
level. In one embodiment, MS/MS is used to analyze and identify the proteins, peptides, or 
metabolites. In another embodiment, genes, proteins, peptides, or metabolites that exhibit a 
5 statistically significant change are identified during the manual inspection step 1140,1260, 1335. 
Subsequent to identifying all genes, proteins, peptides, and metabolites 1145, 1270, 1365, a list 
of those genes, proteins, peptides, and metabolites is extracted and stored 1150, 1275, 1370 for 
future comparison^ 

Figure 26 depicts an MS/MS spectrum of the peptides generated by hydrolysis of the 
10 proteins extracted from mouse plas ma, wh ich corresponds to step-1265 in Figure i2. Those — 
peptide fragments^'vrfiich are labeled b7-hl 7 ahd y5-yl6, are compared to a database, so that the 
protein which was fragmented can be identified and sequenced, which corresponds to the 
identification step 1270 in Figure 1% hi this particular case the protein identified is human 
ApoE3 which is the protein introduced by the transgenic manipulation. 
15 Table I lists the key differentially expressed components extracted from the lists of genes; 

proteins, and metabolites. This list was generated in accord with steps 1150, 1275, 1370, which 
are iUustrated in Figures 1 1-13 . The extracted list of components also corresponds to the extract 
list of components step 220 in Figure 2. 



Table I. Key differentially expressed biomolecular components (Excluding human ApoE3). 



Biomolecular 
component tvoe 


Component ID 


Name 


Fold Ratio 
fAPOE3:WTV 


Gene 


G 7801 


Heat shock 70 KD protein 


3.10 


Gene 


G562 


KDKEN cDNA 3230402M22 


2.72 


Metabolite 


Ml 


Trigycerides 


2.59 


Metabolite 


M7 


DAG CIS, 20:1 


1.92 


Metabolite 


M9 


LysoPCC16:0 


1.68 


Gene 


G7485 


Apoptosis inhibitory 6 


1.51 


Protein 


P 1059 


FABP (fetty acid binding protein) 


1.36 


Gene 


G 1615 


Heterogeneous nuclear RNP HI 


1.35 


Gene 


G693 


FABP (fetty acid binding mRNAI 


1.33 


Gene 


G1032 


Translation Initiation Factor 2 


1.14 


Metabolite 


M3 


PCC20,20:8 


0.94 


Gene 


G8147 


Apolipoprotein Al 


0.76 


Protein 


P744 


Protein Kinase C, epsilon 


0.74 


Protein 


P451 


ATP-binding cassette (ALD), meml 


0.72 


Protein 


P1439 


Heme oxygenase-2 


0.64 


Protein 


P1362 


IPF1 


0.59 
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In one embodiment, the individual biomolecular components listed in Table I are 
normalized, so a more meaningful comparison across biomolecular component types can be 
performed. In another embodmemythe list of biomolecular con^dnents listed in Table I are 
used to produce a correlation network in accord wifli step 225 in Figure 2 and step 1420 in 
Figure 14. Figure 27 illustrates a correlation network between biomolecular component types. 
The network was-produced a non-i^ 
associations betweenihdl^ 

then may be compared to eastmg khowledg^e.rrom me.:^ or other public information 
sources, winch corresponds to step 230 in Figure 2 or step 1425 in Figure 14. Figure .28 
mustrates a map of the known reiauonlbstween the correlation ne^rk; association and.. ~ " 
pubUshed'infornwtipn, 

Referring back to.Figure 14, an iUustratiVe embodun^ network 
associations that are analyzed to determine biorharkers or mechanisms of action 1430 is 
depicted. .Theknown relations ma.y be analyzed to deterrnine biomaikers or mechanise of 
action 1430. In ohe embbdhnen^ me correla^ . 

associative and cau^ across biomolecular component types 1435. The known 

relationsalso niay 

biomolecular component types 1435. 

Returnmg to Figure 2, mohe system is perturbed 235. Asstafed above, 

the perturbed system men;rhay beused to produce newdata sjets, new correlations networks, and 
new correlation network associations before deducing die causal mechanisms of me perturbation. 
The perturbations to the systemmaybe iterated until causal; relations are determined between 
^multiple bimolecular component types. 

From Ae'bibmarkers determined from a systeh^ 
described above, markers that differentiate diseased and hedmy populations may be derived. 
Trus ntformation can then beplaced in the appropriate biological context to determine, e.g., 
when a marker can be identified as either a causative agent or a downstream product of a 
disregulated pathway. As described above,;comprehensive,gene, proteihi and metabolite 
profimig,:coupled vrim correlation analysis and network modeling,.provide insight intb 
biological context, and this level of knowledge may bensedto develop therapeutic agents or 
may serve as a basis for directed, hypomesis-driven experiments that are designed to further 
elucidate pathophysiologic mechanisms. 
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Figure 29 illustrates typical "offerings'* or "deliverables," in terms of biomarkers or 
therapeutic agents that can be derived from a systems biology analysis. Described below are two 
examples that illustrate not only typical systems biology analyses, but also a more detailed 
description of how the information derived from these systems biology analyses is employed to 
5 determine not only which Aerapeutic agents should be used, but also which pathophysiologic 
mechanisms require further study; 

Examples. Systems Biology Analysis oftiieAPOE*3~Leiden Transgenic Mouse 

Hie results of combined mRNA expression, soluble protein, and lipid differential 
profiling analyses applied to liver tissue, plasma, and urine taken from wild type and APOE*3- 
16 ^ I^iden^ce^lhat were-fed a normal chowdiet and sacrificed"" at 9 weeks of 2ige are presented^ 
below* Results from each bionwlecular component type class analysis reveal the presence of 
early markers of predisposition to disease. In addition, results of a correlation analysis are 
suggestive of networks of molecules - spanning genes, proteins and lipids - that: undergo 
concerted change. 

15 Animals* APOE*3-Leiden transgenic mouse strains were generated by microinjecting a 

twenty-seven kilobase genomic DNA construct containing the human APOE*3-Leideri gene, the 
APOC1 gene, and a regulatory element termed the hepatic control region that resides between 
APOC1 and.APOE*3 into male pronuclei of fertilized mouse eggs. The source of eggs was 
superovulated (C57B1/6J x CBA/J) Fl females. Transgenic founder mice were further bred with 

20 C57B1/6J mice to establish transgenic strains. Transgenic and non-transgenic littennates of F21- 
F22 generations were used in these experiments. All mice were fed a normal chow diet (SRM- 
A, Hope Farms* Woerden, The Netherlands) and sacrificed at nine weeks, at which time plasma, * 
urine, and liver tissue samples were taken and frozen in liquid nitrogen. The samples from each 
individual were then subdivided for separate gene expression, protein, and metabolite analyses. 

25 Liver gene expression. Total mRNA was extracted from homogenized liver tissues 

using commercially bought, KNAeasy kits (Qiagen, Germantown, Maryland). mRNA was then 
extracted from the total RNA preparations using a commercially bought, Oligotex kit (Qiagen, 
Germantown, Maryland). Gene expression microarray data were acquired using tbe Mouse 
UniGene 1 spotted cDNA array (hicyteGenomics, St Louis, Missouri). An analysis of variance 

30 (ANOVA) model was selected for the design of the sample pairings that optimally reduces 
variation inherent in the technique. 
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Liver protein profiling. Frozen liver tissues were powdered in a pre-chilled mortar that 
was kept cold with the addition of liquid nitrogen. TrPER protein extraction reagent (Pierce 
Chemical Co.,'Rpckford, Illinois) was then added at 8 fiL/mgof tissue, and the sample was 
further Homogenized by sonication. Samples were then centrifuged at 10,000 *g for 5 minutes, 
and the supernatante coUec^ Relative total protein concentrationsjwere determined from 
integrated whole^hromatograms of aliquots that had,been injected into a size exclusion 
chromatography systeii^ consisting of a Super SW3000 TSKgel column (Tosoh Biosep, Tokyo) 
iEmd/miC Packingis tJlt^ To reduce sample complexity, the 

protein sipe^ reversed-phase chromatography on a VISION 

Worki^tioh 

column (4,6 x :100 mng Implied Mosystems; Foster City^ Caltfoima) th^ y^ duted with a 
water/ac^mtrile (Me(^!ig?^^k the prerence of 0;1% fcMuoroacetic Jfcid (TFA), Proteins 
were digest^flien^ mM 
Mldum chloride; ^ mM ditinbtiireitdl ^ 
ioddacetamide atf ySSCrfcr 30mini^,;ai^^ 
ho^ at : 37°G, / 

protein LG/1!^ 

wasperfbnnedusing an.LCQ DecaXP (TTien^ Jose^ CA) qu^^ 

mass spectrometer system equipped wiihW 

consi^d of ^ quaiemaiy gradient p Jose, 

CA). Samples were suspm^ 

column (150 * tmm, 5 jim) (GraceVydac, Hesperia, CA). The column was eluted.at.50 
^L/minute isocraticly for two minutes with Solvent A (water/MeGJJ/acetic acid/TEA* 
95/4.95/0:04/0.0 1 , vol/vol/vol/yol) Mowed by a linear gradient over 43 minutes to 75% Solvent 
B (wateiMeCN/acetic wid/TPA, 20/79:95/0.04/6.01, vol/vol/vol/vbl). The electrbspray 
ionization voltage was set to 425 kV and the heated transfer cap^ 200°C. Nitrogen^sheath 
and auxihary.gas settings were 25; and 3 units, respectively. For quantification of tryptic 
peptides, the scan cycle consisted of at single full scan mass spectrum acquired over m/z 400- 
2000 in the positive ion mode. Data-dependrat production mass?spectra (MS/MS) were also 
acquii^.f^ identification using the TurboSEQUEST algorithm (^ermoFinnigan, San 
Jose,CA). 
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Liver lipid profiling. Liver tissue was fieeze-dned, pulverized, and then extracted with 
20 fiL isopropanol per mg of tissue in an ultrasonic bath for 2 hours. The samples were then 
centri&ged and the supernatants collected. Samples were then diluted with 4 volumes of water 
and taken for LC/MS analysis. LCMS data were acquired using an LCQ (IhennoFinnigan, San 
5 Jose, California) quadrapole ion trap mass spectrometer equipped with an electrospray ionization 
probe. The LC component consisted of a Waters 717 series autosampler and a 600 series single 
gradient fonning pump (Waters, Mlford, Massachusetts). Samples were injected in duplicate in 
random order, onto an Inertsil column (ODS 3 ,5 ^ 100 x 3 mm) protected by an R2 guard 
column (Chrompack). Three mobile phases were usedinthe elutiori: (I) 

10 . C^erZMe(^/^^ acid, 93;9/5/l/0.1 7 vol/vol/vol/vol)r^> (acetohitrile/ ^ 

isopropanol/ammonium acetate/formic acid, 68.9/30/1/0.1, vol/vol/vol/vol), and (3) 
(^opropanol/dicmoromethane/ammonium acetate/formic acid, 48,9/50/1/0.1, vol/vol/vol/vol). 
The column was elutedat 0-7 ml/minute using a two-step gradient Step (1) from 0 to 15 
minutes beginning with 70 % A> 30 % B, 0 % C and ending with 5 % A, 95 % B and 0 %, and 

15 Step (2) a 20 minute gradient with no change in A, 95% to 35% B, .and 0 % to 60% C. The 
electrospray ionization voltage was set to 4.0 kV and the heated transfer capillary to 250°C. 
Nitrogen sheath and auxiliary gas settings were 70 and 15 units, respectively. For quantification 
of metabolites, tire scan cycle consisted of a single full scan (1 s/scan) mass spectrum acquired 
over m/z 250-1200 in the positive ion mode. 

20 LC/MS data pre-processing, LC/MS data sets, were converted into ANDI (.cdf) format 

using the File Converter functionality built into the Xcaliber instrument control software 
(ThermoFinnigan, San Jose, California). The IMPRES S algorithm (TNO Phanna, Zeist, The 
Netherlands) was then applied to the converted files for automated peak detection and peak data 
quality assessment The program evaluates each mass trace for its chromatographic quality by 

25 assessing its information content The LC/MS chromatogram at each mass to charge ratio were 
smoothed to remove noise spikes and then the entropy of the trace was calculated using Equation 
12. Taking the reciprocal value of H and scaling all results to the largest value gave each mass 
trace a scaled chromatographic quality number called the Impress Quality (IQ); 

N 

H—^pl\og{p^). (12) 

30 An IQ threshold was then-selected, and if the IQ of a peak was below the threshold, the peak was 
deemed to be of poor quality and was not taken forward to clustering analyses described below. 
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Normalization of microarray data. As described above, die data may be represented 
by the following model; 

where the gene and variety effects are described by ji^ the array effect by A h the dye effect by 
. and the error by The error is normally distributed with zero mean, and the variance, 
a 1 ^ , is not permitted to be different for each gene and variety. The optimal parameters of the 

model are calculated Using a: maximum likelihdod estiiriator. For each particular array and dye, 
the samples are then scaled as: 

Statistical tests of significance. To .estimate the statistical significance of difference 
mean normalized intensities firbm transgemc and wild type samples, a t-test was applied for each 
bfthe ^genes, and the cori^onffihg p-values were calculated! When assessing the statistical 
signffica^ce of fbld change for each gene, a total Nj>values were collected, so several/lvalues 
withp < 0.05 were expected. To ac^itofortiiis, ffieoverall likelihood P(p)^ of observing a-j? - 
value £ p for any of the N genes was used. Assuming independence of all genes, the overall 
likelihood was estimated with: 

m«i-a-pf. as) 

PCD A analysis and correlation plots. Principal component and discriminant analyses: 
(PCDA) wereapplied to the tryptic peptide and lipid LC/MS profiles that had been pre- 
processed with the IMPRESS algorithm as described above. This was done using WINLIN 
statistical software (TNO Pharma, Zeist, The Netherlands). 

Microarray analysis of liver gene expression. Mouse liver mRNA samples were paired 
for hybridization on the UniGene 1 cDNA.spotted microarrays following the "loop design'' 
shown in Figure 30A. This method of pairing was based on an ANOVA model that was 
designed to provide a basis for optimal normalization of gene expression data and to minimiz e 
ihe contribution of variability that might arise from factors, such as unequal rates of 
hybridization between nucleic acids or. dye effects. mRNA samples were labeled with Cy3 and 
Cy5 for dual hybridization, as shown. 

As evidenced by the cDNA microarray data scatter plot shown in Figure 30B, relatively 
few genes were differentially expressed at the 95 % confidence level. Values were plotted as 
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mean values of expression in wild type and APOE*3-Leiden transgenic mice, and data points 
we color-coded on the basis of statistical significance. Far fewer met a more rigorous overall 
likelihood^ P(p^ assessment that attempts to rule out chance events where data may randomly, 
but falsely, have j?-values < 0.05. 

Table II lists a sample set of genes where the fold-ratio between transgenic and wild type 
control was either less than 0;8 or greater than 1 2. The relatively low^values that were 
observed despite die rather narrow margins of difference in egression reflect the statistical 
advantages of the ANOVA model Of note are the lower levels ofexpression of apolipqprotein 
AI and an analog of apofipoprotein B in the transgenic animals, while ananalogof 
-apolipppr otein F w as higher. Interestingly, prionanalysis of plasma obfcrined jBrom the AP0E*3- 
Leiden mice revealed an approximately tworfbld down regulation at the protein level. In 
addition, peroxisomal proltfemtor-activated receptor-alpha (PPARa) egression was not 
different between the two populations, while liver fetty acid binding protein (L-FABP) was 43 % 
hi^erintheti^genics; PPARa plays akey role in initiating gene expression of proteins 
involved in lipid metabolism, whHe experimental evidence suggests that L-FABP may control 
the activity of the transcription iartor by controlling the rate of presentation of activating ligand. 
The lipid profiling analysis shows that lipid metabolism is^indeed impacted by the presence of 
the transgene, and in the absence of change in PPARa levels, these data support a regulatqry role 
for L-FABP. 
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Table II. Liver mRNA expression. 



Description 



claudin4 
CD8beta opposite strand 
iroquois redaited homeobox 3 (Drosophila) 
cysteine rich protein 
Apolipoprotein A-I 
feity acidbinding protein 5, epidermal 
ESTs, Moderately similar to 156333 apolipoprotein B - rat 
pterins 

nitric oxide synthase 3, endothelial cell 
ornithine aminotransferase 
glutathione S-1ransferase, alpha 1 (Yaj 

ma late dehydrogenase, mitochondrial .... ; 

"extraceUtdar ^teinaseM 
•antigen 



"ESTs, WeaMy similar to apolipoprotein F [BLsapiehs] 
receptor (caldtonin) ai^vity 
cytochrome c oxidase, subunit VIIc 
eositfopM^ 

cytochrome c oxidase, subunit VTTa 3 
histidme triM nucleotide-binding protein 
malate dehydrogenase, soluble 
Mjmusculus H2B gene 

ATPase, tim^oiting lysosomal (wcuolarprotoii pump) 
ATP synthase^ H+ transporting, mitochondrial F0 complex ' 
thymosin, beta-4, X chromosome >: 
ganglioside-induc^ differentiatidn- 

scilute carrier family 35 (UDP-galactose transporter), member 2 

glucose regulated protein, 58 kpa 

spennidine/spermine Nl -acetyl transferase 

fitty acid binding protein 1, liver 

signal recognition particle 9 kDa 

orosomucoid 2 

cathepsinS 

Lysozyme 

nuclebbindin 2 

orosomucoid 1 

serum:amyloidA3 

major urinary protein 1 

DnaJ (Hsp40Xhomolog, r subfamily C, member 3 
SEC61, gamma subunit (S. cerevisiae) 
calcium binding protein Al 1 (calgizzarin) 
tumorrejection antigen gp96 
proteoglycan, secretory granule 

.heat shock 70kD protein 5 (glucose-regulated protein, 78kD) 



Ratio 


p-value 


0.59 


0.001 


0.69 


0.003 


0.72 


0.001 


0.74 


0.006 


0.75 


0009 


0.75 


0 044 


0.75 


0043 


0.77 


0.019 


0:81 


0.018 


1.22 


0.016 


1,28 


0.029 


1128 


0.002 


1.28 


0.027 


a.28 


0.037 


1.28 


0 028 


129 


0.032 


1.29 


0.040 


1.31 


0.013 


1.32 


0 044 


1.33 


0,031 


1.33 


0023 


1.34 


0 021 


1.34 


0 048 


1.39 


0 018 


1.40 
L40 


0 024 


1.42: 


0.012 
0.024 


1.43 


0 021 


1.43 


0.030 


1.43 


0.024 


1.45 


0.034 


1.46 


0.020 


1.48 


0 033 


1 49 


0 007 

v.v/v/ / 


XwU 


U.UI J 


1.51 


0.009 


1.51 


0.001 


1:56 


0.005 


1.58 


0.012 


1.60 


0.008 


1.70 


0.004 


2.01 


0.003 


2.45 


0.001 


2.93 


0.001 
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Quantitative profiling of liver proteins* Off-line reversed phase separation of soluble liver 
proteins to decrease the sample complexity by approximately a fector of 20 was initially 
employed. An ESI-LC configuration was coupled to the mass spectrometer that was capable of 
handling hundreds of consecutive injections. Next, data was acquired using anMS-only scan 
5 cycle, without acquisition of sequencing MS/MS scans, To reduce cycle time and minimize the 
loss of information that occurs while the column ehites between scans. 
As shown in Figure 31A, LC/MS chromatograms were acquired for digested liver protein 
fractions from five APOE*3-Leideri and five wild type mice. , The IMPRESS algorithm was then 
applied to each data set to extract peak intensity and signal quality Mormation. An IMPRESS 

10 quality value of 0.5 was selected as the threshold bdow wMch^obf quality signal data would be 
excluded from further analysis- Clustering was then performed using the principal coihpqneiit- 
discriminant dialysis (PCDA) tool built into the WINLIN software. As shown in Figure 3 IB, 
two distinct cluster were, observed with transgenic mice in one and wild type mice in the other. 
An inspection of the fector spectrum, illustrated in Figure 31C, provided masses of the ions that 

15 differentiated the two clusters. At-testwas applied to each of the differentiating ions to test 
significance, and an LC/MS/MS spectrum was acquired for each peptide. Six tryptic peptides 
that were each derived from a digestion of L-FABP, with mass to chaise ratios 446, 599, 706, 
892, 895, and 1058, are labeled in Figure 31C Since the fector spectrum is semi-quantitative in 
nature, peak intensity information gathered by IMPRESS was used to calculate relative 

20 differences. The results of this profiling analysis indicated that L-FABP was up-regulated by 
44% in transgenic mice relative to wild type controls. This was essentially a one-to-one 
correlation with themRNA expression observation noted above. Table IH summarizes the 
results of the protein analysis. 
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Table DDL Liver protein expression. 



Description; ; 

Apoptosis protein MA-3 ~ 

Aaaloglycoprotein keceptor 1 

ATP-fiinding Cassette, Sub-Family Member 1 

Beta-CrystaIlinB2 

Efe3^Mouse Ephrin^ (Eph^Related Receptor Tyrosine 

Kinase Ligand 3) 

Li vef rFatty Acid Binding Protein 

Wrkli^T^^ 

Glutathione^^ 

, Guan^ Chains 
Lbng^Form ' " " 

j^^elfa^ 
Heo^ 

Hemopoietic; C^^ TyrosinePhpsphatase 

t^tbiu^ 
^erGefe 

Lymi^c:^ - - v 

Md6 Protein 
Mouse; ^ 

NbdaJMfcuse iNbcial Precursor 
Nui^BmcMg J 

Probable El -Eg Atpase - Mouse ^ragmenf) 
ProcdUagen,T3^eV r Alpba2 
Protein Kinase QEpsilon 
Pyruvate Kinase 

Ubiquitih-Protein Ligase E3a ' 



Ratio 


P- 


TG/WT 


value 


0.85 


0.019 


1,39 


6.028 


0.72 


0.025 


0.76 


0.016 


0.52 


0.005 


1.44 


0.036 


1:24 


0:008 


0,6? 


0.015 


1.36 
1.59 


"~ 0.002 

6;oi4 


0.64 


O 014 


1.38 
048 


0.020 

ft nftfl 


0.62 


V.UDU 

6.034 


1.98 


0,019 


0.53 
0.67 
6:85 


0.007 
0:035 
6-019 


6.52 
1.12 
6.83/ 


0.006 

0;032 


1.20 
0.74 


0.034 
6.012 


1.24 


0.105 
0.008 


1.15 


0.079 



Quanti&tiveprofaing of liver Upids- Hpids were proved using a strate^ similar to that used 
for the protein analysis. Duplicate tetasets/were acqu^ The- extraction 

protocol and LC.system was designed to fiactipnate larger, non-polar lipids such as 
diacylglycerols (DG) and triacylglycerols (TG). Captured within this acquisition were also 
quantitative profiles of phosphatidylcholine (PC) and lysophosphatydylcholine (LysoPC) lipids. 
FoUowing datapre-processing wi& 

analysis was performed using WINLIN. As shown in Figure 32A,,the two populations of mice 
formed two distinct clusters, The PCDA fector spectrum, illustrated in Figure 32B, indicates that 
a number of lipids contribute to me.di&erence'belween to the two populations, Mass to charge 
ratio ranges that include the majority of lysophosphatidylchdlines (LysoPC), diacylglycerols 
(DGX phosphatidylcholines (PC), and triacylglycerols (TG) are. indicated. 
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As summarized in Table IV, a number of triacylglycerols were higher in the transgenic 
mice, while cone were found to be in lower abundance. Similarly, two 
lysophosphatidylcholines, l-palmitoyl-24iydro^ CI 6:0) 

and l-Stearoyl-2-Hydroxy-sn-Glycero-3-Phosphocholine (LysoPC CI 8:0), were found at higher 
levels in the APOE*3-Leiden mice, while there were no significant differences observed for 
other LysoPCs. Interestingly, among the diacylgjycerol and phosphatidlycholine sub-classes, an 
overaU trend toward higher abimdance in the transgenic animals was not observed, suggesting 
that the disruption of lipid metabohsm imposed by insertion of the tiansgene leads to a complex, 
multifactoxal change in the regulation of lipid levels. 
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Table IV, Liver lipid: Fold difference between APOE*3-Leiden transgenic mice and the 
wildtype control mice. 



Description Species 


Ratio ■ 
TG/WT 


p-value 


Lysophosphatidylchoune 016:0 


1.31 


010190. 


018:6 


124 


0.0241 


Diatylglycerol 018,020:1 


1.43, 


0.0064 


022,20:1 


0.78 


0;0151 


02222:10 
C22v22:3 


0.80 
0.77 


0:0018 
0.0070 


Phosphatidylcholine G18 18-0 




A AAOO ■■' 


~G20;l8-2. 


0.79 


0.0231 


C2020:8 


0.7T 


0.0422 


020,20:7 


. 0.82 


.0:0341 


020,20:4 


1.50 


0;0138 


020,22:3 


2175 


0.0001 


020,22:2 


1,85 


0.0023 


020,22:1 


.1.20 


0.0059 


;C22 i 22:4 


2,82 


0.0005 


'022,22:3 


1.84 


0.0002 


C22,22;2 


1.37 


9.4E-06 


Triacylglycerol . 050:0 


2.02 


9.7E-07 


C56:7 


1.87 


6.9E-06 


056:6 


1.96 


2.8E-08 


€56:5 


1.60 


0.0003 


056:4 


1.97 


o.Oooo. 


056:3 


1.84 


0.0058 


056:2 


2.15 


0.0069 


G58:10 


5.38 


6.0004 


058:9 


2.94 


2.05E-06 


058:8 


2.43 


1.13E-07 


058:7 


1.93 


6:78E-10 


058:6 


2.42 


L40E-09 


058:4 


• 2.70 


1.62E-05 


C58:3 


2.15 


0.0001 


058:2 


1.37 


0.0077 



Discussion. As highlighted in Figures 33A-33C, the comprehensive systems analysis 
based on differential genomic, proteomic, a metabolomic profiling yielded a number of novel 
observations that distinguish the APOE*3-Ldden,t^genic mouse from wild type controls 
under conditions where the mice display essentiaUy no cMcal indications of disease. Following 
PCDA clustering analysis and identification of differentiating factors, the relative abundance of 
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eachbiomolecular component type, mKNA, protein, and lipid, was calculated and is shown in 
Figures 33 A, 33B, and 33C, respectively. Values represent the mean =fc SEM for n = 4-5 separate 
animals (*/?< 0.05). Taken individually, each of these entities may serve as a biomarker of an 
altered metabolic state that predisposes a subject to hyperUpidemia and atherosclerosis. 
5 Key species in atherosclerosis identified as early markers of disease in the APOE*3- 

Leiden mouse are illustrated in Figure 34, In humans, the AFOE*3-Leiden mutation gives rise 
to a dysfunctional apolipoprotein E variant that is has reduced affinity for the low-density 
lipoprotein receptor (LDLR). Similarly, AP0E*3-Leiden ti^genic mice also develop 
hyperlipidemia and are susceptible to diet-induced atherosclerosis. Early markers of pathology 

10 that were found via systems biology in young mice that were reared on a normal chow diet-are - — 
indicated with arrows (upward pointing denotes up-regtiJation in the transgenic, while downward 
pointing denotes down-regulation in the transgenic). These markers include Apo AI and L- 
FABP inRNA and protein, and a variety of lipid molecules. For example, iipoprotein-associated 
phospholipase A2 (which is also described as platelet activating factor acetyl hydrolase) is an 

15 enzyme that catalyzes the generation of LysoPC from PC in circulation and has been identified 
as arisk factor for heart disease. [Packard fi/ a/., N. Engl. J. Med. 343 1 148 (2000).] LysoPC 
contributes to early pro-inflammatory events that contribute to pathogenesis, where they increase 
monocyte adhesion and chemotaxis during fatty streak development In the present study* two 
LysoPC compounds that are elevated in the livers of APOE*3-Leiden transgenic mice were 

20 identified, suggesting that early inflammatory events in the liver may play a role in the 
pathogenesis of atherosclerosis. 

The apolipoproteins and L-FABP constitute a second macromolecular group of 
biomarkers. Apolipoprotein AI (ApoAI) is significantly lower in the plasma of APOE*3-Leiden 
mice compared to wild type controls. Here, mKNA transcripts for this apolipoprotein were 

25 found to be lower in the liver, bolstering the previous observation and therefore supporting a role 
for lowered ApoAI and HDL levels as contributing factors to predisposition to disease. 

Evidence for elevated L-FABP was also provided by both genomic and proteoinic 
analyses. ApoE-deficient mice that were also deficient for adipocyte fatty acid binding protein, 
aP2, were protected against atherosclerosis via a mechanism involving impaired macrophage 

30 function. [MakowskiefaZ. 9 Nat Med. 7,699 (2001).] L-FABP is member of the same family of 
intracellular fiatty acid binding proteins. It is believed to play a role in transcriptional regulation 
by acting as a shuttle for ligands of PPARa. [Wolfium et aZ., Proa NatL Acad. ScL USA 98, 
2323 (2001).] In humans, ApoAI expression is transcriptionally regulated by PPARa. Of 
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particular interest, the results of the present study show an uncoupling of the relationship 
between L-FABP and PPARa-mediated ApoAI expression, t since L-FABP levels were elevated, 
PPARa levels were unchanged, and ApoAI expression was lowered. These results therefore 
suggest that an addiifdonal, but essential^ factor is absent or down regulated. It is intriguing to 
5 speculatethat this factor might be a particular ligand for PPARa. 

hi conclusion, we h*ve shown that me results of systems biology approach of profiling at 
the rnKSf A protein, and lipid levels-has uncovered a number 6f novel biomarkers for early 
preowppsifaWofAP^^ Taken. 
qpUectiyely,^ 
.llk-^geaterpr^ 

emiedtheel^i^onof;UTter^ 

hasprovided;^ of disease as Well as avenues for therapeutic 

intervention. 

Exmpte j.: Systems biology approach: mj^araUelm 
15 iremsgemc mo^e m^del ! 

tteresulteiofasy^ 

j manmiaUanhypertipi A platform 

int^^mg prpfeornic and mc^dlomic analyses and quantitative differentiatihg dis 
underlying a tiansgenic system are described. To gain insight ihtoia multiftctorial disease such 
20 as hyperlipid^a and atherosclerosis, a systems biology approach to profile protein and 
metaboUtecbnstimehtsin whole.plasma of ApbE*3iUiden:n^ The 
results cbnfran Mown lipid metabolism processes, and^eluda^te novel differed at the 
lipoprotein .and hpidleyelsm 

The overall approach to systems analysis, a'whole plasma paraUel proteO-melabolic 
25 profiling scheme, apphed in mis stady is schematically outlined in Figure 35. Whole plasma; 
lipid, and protein fractions from ApdE^S-Leiden and control mice were analyzed by NMR and 
MS. Both metabolic and protein data sets were filtered through the'IMPRESS algorimm:ana- 
clustered simuhaneously using WINLIN statistical software as described in the text Separation 
and spectroscopic analytical methods, such as HPLC, NMR and LC/MS, were combined with 
• 30 powerful statistical par^ 

duster and identify biochemical constituents in plasma of control vs. genetically perturbed 
animals. The results show major C> 2-fold) and less obvious, but statistically significant (p < 
0.05 fctest) differences at the protein and metabolite levels. 
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Animals. APOE*3-Leiden transgenic mouse strains were generated by micioinjecting a 
twenty-seven Idlobase genomic DNA construct containing the human APOE*3-Leiden gene, the 
APOG1 gene, and a regulatory element termed the hepatic control region that resides between 
APOC1 and APOE*3 into male pronuclei of fertilized mouse eggs. The source of eggs was 
5 superovulated (C57B1/6J x CBA/J) Fl females. Transgenic founder mice were further bred with 
C57B1/6J mice to establish transgenic strains. Transgenic and non-transgenic littermates of F21- 
F22 generations were used in these experiments. All mice were fed a normal chow diet (SRM- 
A, Hope Farms, Woerden, The Nether lands) and sacrificed at nine weeks, at which time plasma 
tissue samples were taken and frozen to liquid nitrogen. The samples from each individual were 
10 then subdivided for separate protein.and metaboUte analyses. - — 

Plasma lipoprotein profiling. Plasma from 9-week old mice that were kept on regular 
chow diet (SRM-A, Hope Farms, Woerden, The Netherlands) was fractionated by size exclusion 
chromatography through a Super SW3000 TSKgel column (Tosoh Biosep, Tokyo) on an LC 
Packings chromatography system (LMonex, Marlton, NJ). Total protein concentration for each 

15 sample was determined by the Bradford assay and 10 pL of whole plasma normalized to the 

lowest concentration was injected and duted isocraticly in 20 mM Bis-Tris Propane, pH 6.9; 100 
mM NaCl at 50 pIAninute. Base-resolved peaks corresponding to molecular weight ranges of 
greater than 300 kD were collected as discrete fractions. Proteins were digested, thermally 
denatured and reduced in 100 mM ammonium bicarbonate, 5 mM calcium chloride and 10 mM 

20 dithiothreitol at 75°C for 30 minutes, alkylated with 25 mM iodoacetamide at 75 0 C for 30 
minutes, and then digested with 0.3% (w/w trypsin/protein) for 24 hours at 37°C. 

Protein LC/MS analysis. Liquid chromatography-mass spectrometry (LC/MS) was 
performed using an LCQ DecaXP (ThermoFinnigan, San Jose, CA) quadrupole ion trap mass 
spectrometer system equipped with an electrospray ionization probe. The LC component 

25 consisted of a Surveyor autosampler and quaternary gradient pump (ThermoFinnigan, San Jose, 
CA). Samples were suspended in mobile phase and eluted through a Vydac low-TFA C18 
column (150 * l mm, 5 pm) (GraceVydac, Hesperia, CA). The column was eluted at 50 
uL/minute isocraticly for two minutes with Solvent A (water/acetonitrile/acetic 
acid/trifluoroacetic acid, 95:4.95:0.04:0.01, vol/vol/vol/vol) followed by a linear gradient over 

30 43 minutes to 75% Solvent B (water/acetomtrile/acetic acidVtrifluoroacetic acid, 

20:79.95:0.04:0.01, vol/vol/vol/vol). The electrospray ionization voltage was set to 4.25 kV and 
the beated transfer capillary to 200°C. Nitrogen sheath and auxiliary gas settings were 25 and 3 
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units, respectively. For quantification of tryptic peptides, the scan cycle consisted.of a single full 
scanmass spectnim acquired over^ Data-dependent 
producf ion mass spectra (MS/MS) were dso . acquired for peptide identification using the, 
TuiboSEQUEST algorithm (ThennoFinnigah, San Jose, CA) in conjunction with NGBInr, 
5 Swissprot arid MSDB data base searches using MASCOT search algorithm (Matrix Science). 

Metabolite analysis. The mouse plasma samples were prepared for global lipid and 
metabolite analysis by adding 0.6 mL of isqp^ followed by 

centrifLigation to predpitate and rcmove proteins. A 500 |tL aliquot of the supernatant was; 
concentt^ Toprepare 
^0 - 1 sMpl^fpr.LC/MSi 4Q0;{iL of water was addbdfo' 100^ ofthe-^OT^ of this; 

mb^ei^^ 

9 NMR analysis. NMR spefcfe w^ 
a Varian UNITY 4MMi^ 
of 293 

15 ; ^8:00&^;^5 degree jRulses wereus^ 

uMg the st^d^dVim An ejqjohential winddwf^ 

20* software. Tc^#tmn these listings all lines in ^ie spectra above a thi^hold cbrrc^pndirig to 

abi^rth^;^^ a data file: suitable for 

statistical >aii&y^ 

LC/MS analysis. An LSQ Cl^ Jose) was used to acquire 

plaana Upid and metabolite. component MS spectra. The LC componeht\consisted of a Waters 

25 717 series autpsampler and a 600 series single gradient-forming pump (Waters Corporation, 
:A^ordvMA), -Sample were injected onto m 3, 5jiM/3 mm x 100 

mm) protected byan R2.guard column (Ghrompack): A : 75 jllL aliquot of mouse plasma extract 
was inj ected twice M u random order. The random sequence was applied to prevent detrimental 
effects of possible drift during analysis on fheresults obtained from statistical statistics. The 

30 elution gradient was. formed by using three mobile phases: (1) (water/acetonitrile/ammonium 
acetate (IM/tyfonnic acid, 93.9:5:1:0.1, vol/vol/vbl/vbl), (2) (acetonitrile/isopropanol/ 
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ammonium acetate, (lM/L)/fdrmic acid, 68.9:30:1:01, vo^vol/vol/vol), (3) 
(isopropanol/dicHoromefhane/ar^ acid, 48:9:50:1:0.1, 

vol/vol/vol/vol). The samples were fractionated at 0.7 mUminute by a four-step gradient (I) 
oyer 15 minutes going from 30% to 95% buffer B; (2) 20 minute gradient from 95% to 35% B 
and 60% C with a 5 minute hold at this step; (3) rapid one minute gradient of 35% B and 60% C 
going to 95 and 0% respectively, and (4) 95% buffer B going back to 30% over 5 minute period. 

The electrospray ionization voltage was set to 4.0 kV and the heated transfer capillary to 
250°C. Nitrogen sheath and auxiliary gas settings were 70 and 15 units, respectively. For 
quantification of metabolites, the scan cycle consisted of a single full scan (1 s/scan) mass 
Spectrum acquired over ^2004700 in the positive ion mode. . — 

Data pre-processing NMR The NMR spectra were aligned manually with WINLIN 
statistical software package (TNO Pbanna, Zeist, The Netherlands). 

Data pre-processing LC/MS. The LC/MS data files were converted to NetCDF format 
using Xcalibur software (ThermoFinnigan). The converted files were evaluated with IMPRESS 
post acquisition noise reduction and normalization software (TNO Pharma, Zeist, The 
Netherlands) to obtain a fingerprint spectrum for each of the LC/MS files. The program 
evaluates each mass trace for its chromatograpHc quality by assessing its mformation content 
This is performed, after smoothing to remove spikes and by calculating for each mass tie 
entropy of the trace according to Equation 12. Taking the reciprocal value of H and scaling all 
results to the largest value gives each mass trace a scaled chromatographic quality, or IQ. 

PCA and PC-DA analysis. Principal component (PCA) and discrhninant analysis (PC- 
DA) were applied to the fingerprint spectra of the aligned plasma NMR spectra and IMPRESS 
preprocessed LC/MS spectra. This was done using WINLIN statistical software (TNO Pharma, 
Zeist, The Netherlands). 

Differential metabolic NMR analysis. To evaluate the pattern recognition and 
clustering methods for metabolite analysis, a dual approach was used, where NMR was utilized 
as the initial screening method followed by LC/MS, which has been established as a benchmark 
analytical method for metabolome profiling ina variety etiological systems. [Raamsdonker 
ah, Nature Biotech. 19, 45 (2001).; Nicholson et ah Xenobiotica 29, 1181 (1999); Fien et aZ, 
Anal. Chem. 72, 3573 (2000).] To facilitate NMR data processing, the WINLIN software 
package was applied to cluster andestimate the degree of variance "between the wild type and 
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10 



15 



transgenic data sets. Sufficient differences, based on the pretoninaryNMR screen, have 
emerged to warrant further detailed analysis using MS and MS/MS. 

Whole plasma samples from 20 mice (n=10 for each group) were used for global 
metabolite NMR analysis. Fora typical 400 MHz 'HNMR, 750 uL of deproteinated sample in 
MeOD were used to generate triplicate spectra, which are illustrated in Figure 36, for both the 
wildtype mouse plasma sample (WT) and the Leiden mouse plasma sample (TG). After 
referring to the -CH3 signal of MeOD (S = 3.30), line listings were prepared using the standard 
Varian NMR software. To obtam these li^^ 

corresponding to about three times the signal-to-noise ratio were coUected and converted to a 
date file format suitable forjitetisfical ahalysis-appucations. The intent for usingNMR 
fingerprmtmg fbrMtM ahalya^ 
specmccompomds,.bmtoe^blishwhe^rtoe 

warrant a more detailed analysis. Close examination of toe NMR. date revealed small variations 
in the respnanceposition of comparable lines; Variations m ^positions of lines are due to toe 
relative/concentration of toe cbmpounds-in toe samples; and the instrumentinstabilities, such as 
toe temperature and toe homogeneity of toemagnetic field, which were corrected for manually. 
Spectra processed to tois manner were imported into the WINLIN statistical analysis tool for 
discriminant component analysis (PC-DA) clusteruig. 

Figure 37:aiustrates a PC-DA score plot showing clustering of 1^ date forjthe Leiden 
mouse, represented by triangles, and the. control mouse, represented by circles. WINLIN allows 
grapWcm dustertog of results after toe date are normalized and subjected to principal component 
analysis (PCA); Each point within the cluster is spatially positioned to represent one of the : 
-triplicate sets of toe preptocessed spectra. Concentration intensities from each of toe triplicate 
spectra were-used to c^^ Uhp kjamd^. component 

IS analysis is toe extraction of eigenvectors from the variance/covariance matrix to obtain a number 
of orthogonal sets of new variables, called principal components, that are optimized in their 
abiUty to explain amaximum amount of variance in toe original data. In highly correlated data, 
a few of the top ranking principal components will be sufficientto reproduce the significant 
variance in toe original data set. PCA was applied to reduce toe number, of features needed to 
0 investigate the partial linear fit (PLF) aligned NMR spectra of the control and APOE*3 Leiden 
mice. Projections of the samples ontatoe first fifteen principal component axes were then used 
as starting point for linear discriminant analysis. 
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Factor spectra were used to correlate the position of clusters in the score plpts to the 
original features in the spectra by a graphical rotation of the loading vectors. [Windig ei d* % 
Anal. Chem. 56, 2297 (1984) j The difference factor spectnmplot, shown in Figure 38, is 
characterized by a number of lines representing various metabolic components defined by a 
5 range of contribution fectors, specifically, ion m/z's that facmtated clustering of tonsgenic and 
control mouse populations. The height of the lines above and below the axis of the plot is 
directly related to the amplitude of the rarihibution to the overall variance where the factors 
extending below the axis correspond to Irigher spectral intensities in the transgenic animals. 
Since PC-DA separates clusters in a single unique direction, lines projecting below the central 

10 axi s represent N MR spectral patt^. cpn^Kffisnts of higher intensity in the plasma of transgenic" "* 
mice. The lines extending above the central axis symbolize factors present at higher absolute 
concentrations relative to the control group. 

Factor spectra prepared in directions of maximum separation of the two categories were 
used to give Mjnsight into the type of metabolites responsible for the separation of the observed 

15 categories. Preliminary results based on the PC-DA loading plots point to the 53.8 ppm- 84.2 
ppm region and the lipid region (5 1 2 ppm - 5 0.8 ppm) as the primary contributors to 
quantitative variance between Leiden and control samples. 

The lim i t a ti ons of NMR spectroscopy result from the low inherent sensitivity of the 
technique and from the high complexity and information content of NMR spectra. The 

20 sensitivity of the technique is also affected by the mhumum threshold concentrations of 

compounds being detected. Regardless of its limitations, it is clear that NMR based metabolome 
profiling coupled to pattern recognition technology is a powerful analytical approach for 
integration of metabolic data into a comprehensive systems-level analysis. In this study 
however, the purpose of the NMR screen was not to identify specific molecules, but rather to use 

25 the method to determine whether a qualitative degree of differentiation between sample 
populations exists. 

Simultaneous analysis of metabolic and protein components yields expected and 
novel patterns. Metabolite extracts from plasma of transgenic (n=4) and control (n=4) mice 
were prepared by the isopropanol precipitation method. Upon addition of 400 jiL of water to 
JO 100 jiL of extract, the samples were subjected to LC/MS analysis. Figure 39 depicts TICs that 
were collected using single scan mode over the 400-1700 m/z mass range. To apply statistical 
statistics to the LC/MS spectra, the raw data files were first converted to NetCDF format and 
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processed using IMPRESS noise reduction and normalization software. The program evaluates 
each mass trace for its chromatographic quality by assessing its information content This is 
performed, after smoothing to remove spikes and by calculating the entropy for each m/z of the 
trace according to Equation .12. Mass intensities normalized by IMPRESS are assigned a scaled 
5 chromatographic quality number, or the IQ, To perform principal component analysis, the IQ 
basedxhroniatograms in Figure 39 were imported into WTNLIN, and discriminant analysis 
separation was obtained based on two Mtial principal component vectors. 

The ptoteomic whole plasma analysis wasbiased towards fiachons containing 
lipoprotein complexes. This was in line with expectations that most stahsticaUy relevant 
IiL_ c ^ es a^ocM witothe Lddehmutatioh.wm-oscur m this class of proteins, based on me - ' 
transgemc model selected. Whdle.plasma samples from me transgenic <n=4) and control ^) 
animals were fracu^ated by andytical size exclusion c^ 

corresponding te^gh molecular weighfplasma protein component were isolated as described m 
toe experiment^ protocol. Two major early peaks eluted at 23 minutes and 27 minutes, 

15 corresponding to VLDL" (fraction 1) and HDL . (fraction 2^ 

respectively, were used for all subsequent manipulations. Proteins contained in fractions 1 and 2, 
were treated with trypsin to generate proteolytic peptides. , 

TICs of the VLDL fractions from the MS analysis are shown in Figure 40 for the 
• wildtype mouse (wT^ and the Uidenmouse 'CTG). MS/MS spectra coDected for all eight 

20 representative samples were analyzed by TurboSEQUEST to generate hits against NCBI 
noiireAmdant,;humaii and mouse databases. The identities of these initial hits were further 
verified using the MASCOT de novo sequencing and database search tool The threshold for 
assigning protein- identities was based on the minimal sequence coverage set at 20% of total 
residue count The protein MS data were clustered in a way similarto the metabolic component 

25 by generating IQ value spectra followed by discrmiinant analysis. 

To observe^quantitative relationships between metabolic and protein components of 
plasma, an assembly of concatenated heterogeneous data sets was used. Original individual data 
sets were integrated separately and IMPRESS quality in/z values from these^ets were summed 
and subjected to the stetisticd-clustenrig analysis. The resulting score plot, which is.iUustrated 

30 in Figure 41, shows PC-DA clusters for the wild type (WT) and transgenic (TO) animals 
generated based On two principal components rotated to achieve maximum separation in Dl. 
Each point represents linear combination of metabolite and protein variance factors (60 % of 
original data set) for toe individual animals. 
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Filtered m/z intensities from metabolite and peptide spectra were organized in a linear 
fashion in the factor plot, shown in Figure 42. Linear distribution along the central axis 
represents protein and metabolite components with calculated bi-directional contributions to 
variance between the control and transgenic groups. Main positively contributing factors are 
5 sera proj ecting above the nominal cut-off weight of 50. Negative contributors to the overall 
variance project below the -50 set boundary . 

By adding nominal values of 1601 and 3401 to each m/z value in the second protein and 
the metabolic components, respectively, heterogeneous experimental data was analyzed in 
parallel, as shown in Figure 42. Significant contribution intensities were scored-based on the 

10 factorplot specific threshold parameter, which was.setto-50.in4his instance. The masses thar^ " 
were found to be major differentiators between the WT and TQ data sets were extracted and 
identified by LCMSMS . Tie combination intensities (raw data and IQ scores) of 
differentiating factors were measured directly in the LC/MS chromatograms for statistical 
significant (P<0.05) and calculation of fold change. 

15 The results point to a composite profile that corroborates previous findings with respect 

to lipoprotein and lipid abnormalities associated with the APGE*Leiden phenotype. 
jMensenkamp et dl> J. Hepat 33, 189 (2000); van den Maagdenberg etal, J. Biol. Chem. 268, 
IQ540 (1993); Waiiams van Dijk ^o/,, iMerioscler. Ihromb. Vase. Biot 19, 2945 (1999); and 
Mensenkamp et aZ., J. Biol. Ghem. 274, 35711 (1999).] Specifically, at the protein level we 

20 were able to show that human APOE*3Leideri allelic variant is expressed and functionally active 
in the transgenic animals as evidenced by its incorporation into VLDL {protein component 1 in 
Figure 42) and LDL/HDL (protein component 2 in Figure 42) fractions of plasma derived 
lipoproteins. Alternatively, murine ApoAl has been found to be twofold less abundant in the 
plasma of transgenic mice indicating lower degree of incorporation of the apoiipoprotein into the 

25 LDL/ HDL complexes in these animals. 

Although the underlying processes governing HDL metabolism have not been fully 
defined, HDL levels in plasma have been shown to havelnverse relationship with atherosclerosis 
susceptibility. [Callow et <*/., Genome Res. 10, 2022 (2000); and Glass and Witztum.] A 
number of different mechanisms, can control HDL plasma. Most prominent factors identified in 

30 mouse models that contribute to lowering plasma HDL include defects in apoAl, apoE, 

phospholipid transfer protein (PLTP) and the overexpression of cholesteryl ester transfer protein 
(CETP) or scavenger receptor SRB. [Callow et al.\ Williamson et al, Proc. Natl. Acad. Sci. 
USA 89, 7134 (1952); and Wang etal, J. Biol. Ghem. 273, 32920 (1998).] Assuming that the 
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Leiden mutation is functionally analogous to a defective APOE allele, it is highly likely that, in 
the context of the Leiden model, the lower HDL levels are at leastpartially the result of the 
ApoE*3 transgene function. One possibility for decrease in total endogenous ApoAl is the 
stoichiometric imbalance due to constituent overexpression of the hApoE3 and its preferential 
5 recruitment for LDL/HDL assembly. 

This study demonstrates the utility of a multilevel approach for characterization of a 
highly complex system. By generating high content analytical output and comparing integrated 
principle component factors derived from composite data sets, rapid : ^lucidation of identities and 
the relative abundances of major lipoprotein metabolism mediators that dejSne ApoE*3-Leiden 
10 _ genotype was possible. S0ely bgsed on a.bipfluid analysis, this effort represents the first" — — 



attempt to apply systems biology^rationale in a way that unites quantitative proteoinic and 
metabolpme data to explain disease. In the future, it will be possible to enhance this approach by 
including the genomiccomponent in the form of differential transcription analysis of multiple 
tissues and make it truly global with respect to understanding pleotropic effects ofgene . 
15 perturbations.. 

Example 5. Systems biolosv approach: Metabolic Disease Study 

Summary, The overall goal of this example is to demonstrate molecular analysis and 
data integration capabilities according to the invehtioh. The general area of medical interest was 
metabolic disease,, and the materials to be analyzed were serum s^ 
20 (rodent and non-human primate) apd from human subjects. A subset of each group of rodents 
(diseased and control) was drug treated. During the initial phase of the project (Phase I), the 
testor was aware that there were three sample sources (rodent, non-human primate, and human) 
but was blinded to the details of the grouping of the samples within each species. 

The specific objectives of the study were as follows. 

25 Phase I 

■ to undertake metabolite and protein analyses of blinded serum samples from animal and 
human subjects; and 

■ to group the samples based on the serum metabolite and protein profiles. 
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Phase II 

■ after unblinding, to compare the grouping of the samples as determined with the actual 
sample groups; 

■ to define, for each of the sample types, molecular components (biomarkers) that can be 
used to differentiate one group of samples from another; 

* to construct correlation networks for tiie ^ in order to gain insight into the 
biochemical processes underlying tiie disease or drug treated phenotypes; and 

* to determine whether molecular components which differentiate diseased rodents from 
control rodents are similar to those which differentiate diseased human p^ents from 

*~ ~~ control human subjects. ~ 

Blinded analyses of the metabolite and protein profiles for the rat serum samples revealed 
four clearly distinct ®rpig3S that, upon unbhnding, corresponded exactly to the actual groups of 
samples (Diseased + vehicle, Diseased + drug, Control vehicle, Control + drug). Blinded 
analyses of the profiles for the non-human primate samples revealed two distinct groups that, 
upon unwinding, corresponded exactly to thediseased and control groups. For the human 
samples, blinded analyses of the metabolite and protein profiles revealed different numbers of 
groups (4 or 2), depending upon the analytical platform employed Analysis based only on lipid 
profiles revealed two groups that, upon unblinding, corresponded with $6% accuracy to the 
diseased patients and with 89% accuracy to the control subjects. 

A large number of metabolites and proteins were identified that differentiated between 
the groins of animal and human serum samples. The relative levels of these biomarkers in the 
samples provided insight into the biochemical processes underlying the disease or drug response. 
One of the notable findings was the effect in the diseased rodents of the drug treatment on serum 
protein levels. A second, distinct finding was the almost identical widespread changes in fixe 
levels of over 150 senmi lipids in both the diseased rodents and the diseased patients relative to 
the levels in the corresponding control subjects. As a validation of the rodent model as a model 
of the human disease, the tester was also able to usethe set of serum lipid biomarkers found to 
correctly classify diseased versus control rodents to distinguish with good precision the diseased 
patients fromihe control human subjects. 

Introduction. The overall goal of this example was to provide a basis to assess 
integrated platforms of proteomics, metabolomics and informatics technologies as applied to 
comparative studies of pro-clinical and clinical serum samples. Serum samples were provided 
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from a drug treatment study in a rodent model of metabolic disease, a comparative study of 
metabolic disease in human subjects, and a study of a related cohdition.in non^human primates. 
The project was divided into two phases. In Phase I, the testor was blinded with respect to 
sample information andperfoimed comparative quantitative profiling of metabolites and proteins 
5 using a combination of NMR and MS techniques. Informatics methods such as unsupervised 
clustering analyses were applied to the data to determine if the experimental groins could be 
accurately discriminated. At the conclusion of Phase^ the data was unblinded, and it was ; 
revealed that the methods,used had determined groups with a higk degree of accuracy. The 
emphasis of the second phase was identification of metabohtes and proteins that contributed to 
10 _the di ffer wtiuh the rodent drug treatmeht/dlgease study * 

asweUMadefcnriih^o^^ 
rone another. ^ aSdMbife c^ 
their rodent-model counterparts wwi ejqrtorei^ 

. the humm disease animal model. This Ex^ple Inghlights only certmn results in order 

15 to?erempH|y^ 

Sample information* In Phase I of the study^ the testor Was blinded with respect to 
whether the sample were torn u affected (<fceased md/or drug-treated) 

subjects. Unbhnding of flie sample information was done prior to Phase II. The experimental • 
i^oupsandh^ 

20 A- Drug . treatment study in a rodent model of metabolic disease: . A total of 32 serum 

samples (600 jiL each) fern a (bug treatment study where a ther^eutic drug was administered 
to diseased rodents and non-diseased rodents (control) were subdivided as follows. 
n = 8 control treated with vehicle 
n= 8 control treated wi th drug 
25 h = 8 diseased treated with vehicle 

n = 8: diseased treated with drug 
B- Comparative study of metabolic disease in human subjects : A total of 42 serum -samples 
(300 - 400 pL per sample) fromundividuals diagnosed. with metabolic disease and controls 
were subdivided as follows. 
50 n =14 Subjects diagnosed with metabolic disease 

n = 28 Controls 

Ci Disease study of non-h uman primates : A total of 24 serum samples (300 - 850 pL per 
sample) from non-human primates were profiled. 
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n-13 Normal non-human primates monkeys 
n-12 Diseased non-human primates monkeys 
Methods utilized - Analytical profiling, lie approach in the Example to differential 
proteomics and metabolomies employs several distinct analytical methods that enable the 
5 quantitative profiling of a wide range of molecular components. These methods utilize either 
NMR or MS as analytical ehdpoints. Profiling platforms have been optimized taking into 
account robustness, reproducibility, sensitivity, and dynamic range and are designed to survey 
molecules that may span orders of magnitude m abundance as well as a range of biochemical 
classes* Each platform has the capacity to proffle many components (hundreds to thousands) 

10 within a s ingle an alysis^ and softie tools.weremed to:2acilitate the extraction of quantitative 

infoimation for integr^oninto computational and Moim Methods applied in this 

study are listedielow; 

1. Protein LG/MS: allows profiling and identification of peptide and proteins. 

2. CPMG.NMR: enhanced NMR measurement of low molecular weight metabolites. 
15 3. Diffusion-edited NMR: enhanced measurement of lipoprotein-associated 

metabolites. 

4. Lipid LC/MS: optimized for profiling of lipids and non-polar metabolites. 
Methods utilized - Data processing. The resultant NMR spectrum or LG/MS 
chromatogram obtained from a profiling experiment may contain many hundreds of peaks that 

20 represent the relative abundance of hundreds of molecules. Data processing software tools are 
used to enable the extraction of this information from each data file as well as the comparison of 
measured peak intensities across the sample set As described above, typically, data processing 
steps include peak detection and measurement of relative intensities (peak integration), an 
"alignment" step to compensate for minor differences in peak position that might occur from one 

25 sample analysis to another (i.e., small differences in NMR chemical shift or LC/MS retention 
time for a particular peak),.and assignment of an identifier (or index number) to each peak so 
that it might be compared across samples. 

Methods utilized - Data analysis; The data were analyzed using several different 
statistical approaches: (1) unsupervised clustering of samples (including COSA hierarchical 

30 clustering), (2) univariate statistics to detearmine peaks that are different between groups of 
samples, and (3) correlation network analysisto identity correlations between Individual 
components of metabolite and protein sets for all samples. In addition, some preliminary data 
analyses with a support vector machine (SVM) classifier for the purpose of classification were 
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undertaken. Figure 43 is a schematic representation of the data analysis workflow. Elements of 
the data analysis process are listed below id the order they are performed. 

1. Data Normalization: adjustin^for pl^orm-specific variaiion within the dataset 

2. Application of exploratory unsupervised clustering methods: 

- CQSA 

- Principal Components Analysis 

- K-Means Clustering (human samples only) 

- Neural network (human samples only). 

3 . Peiak selection for identification: determine significant, discrimmafing peaks by 
. means of imivanate statistical m^pds^afcv«sei two-tailed /-tests) and prioritizrfdr" 
identification. 
4. Corrdation Networks: defennm£^ 

.5. Data Visualization: use spffvs^ tools to incorporate database information with the 
, e^erimehtaUy generated data 

Results and discussion for the rodent model of metabolic disease regarding analyses 
of serum samples - Unsupervised clustering. Initial. analyses focused on unsupervised 
clustering of data collected from blinded rodent serum samples. Uns^earyised clustering is a 
statistical method that iattempts to group samples with no forekno wledge of sample classification 
or the number of distoct groups in tiie collection of samples. An outline of the work flow is 
provi<^ inE^r44; In general, multiple'daia sets froih multiple analytical platforms.were 
normalized and clustered. To the .extent an individual data set does hot correctly or distinctly 
cluster;;the multiple data sets can be concatenated (i.e.,combined ahd/or:correIated) for further 
clusteririgWalysis. In this Example, dtiipugh^ showed 
appropriate clustering,;the data sets were concatenated and/or integrated and/or corrected to 
obtain an even more robust analysis. The concatenated data was normalized and clustered, and 
the results were recorded as a profile of a biological system. 

Data collected from all individual platforms resulted in clustering of blinded serum 
samples into distinct groups, the only difference between the platforms being the number of 
clusters formed. Clustering into four groups was observed with both the protein and lipid 
platforms.. These four groups that Were ultimately identified insisted of samples 1-8, 9-16, 17- 
24, and 25-32. 

The clustering of the LC/MS proteomic data (i.e., a single analytical platform) is 
illustrated in Figure 44A. Figure 44A is an example of the COSA clustering analysis of rodent 
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serum proteomic LC/MS analysis, after data alignment and normalization. In this analysis, the 
2,977peaks that appeared in at least 28/32 rodents (>87% of the samples) were used for 
clustering. Data obtained from the other metabolite platforms, CPMG NMR and Diffusion- 
edited NMR, clustered the samples into fewer groups but the divisions were consistent with the 
5: groups found during the lipid and protein analyses. 

Figure 44B shows a more robust representation of the four groups (as described above). 
Figure 44B is the result of COSA clustering applied to combined data from all platforms. 
Clustering using CPMG NMR data only revealed three clusters while using DE NMR data only 
revealed two clusters (not shown)- Combining data from prpteomics, lipid LC/MS, CPMG 
10 ; NMR and DE NMR (4|51 variablesitbtal) yielded foxir clear groups?- The-groupings were ' — — 
consistent with the results of the individual treatments of the proteomics data and the lipid 
profiling data. 

Unbliriding the samples reveal^ that groins delimited using these methods corresponded 
exactly to the different rodent cohorts as summarized in Table I below, 

15 Table! Sample Identification Provided After Cluster Analysis 

Sample ID Cohort 

1- 8 diseased rodents treated with vehicle (DlSveh) 
9—16 diseased rodents treated with drug (DISdrug) 
1 7 -24 control rodents treated with vehicle (CONveh) 
25-32 control rodents treated with drug (CONdrug) 

Results and discussion for the rodent model of metabolic disease regarding analyses 
of serum samples- Metabolite and peptide peak identification. Univariate statistical 
methods were applied to the peaks profiled in Phase I to select; for subsequent identification, 
those peaks which exhibited differing abundances among the four groups of rodents. The 

20 primary statistical analysis consisted of a pairwise t-test with a significance level a = 0.05 . The 
workflow for this analysis is outlined in Figure 45. In general, multiple data sets from multiple 
analytical platforms were concatenated, integrated, and correlated, and then normalized. 
Statistically different components between the disease and control groups were extracted, and the 
difference was quantified. Then, the system was perturbed by administering a drug to the 

25 diseased group, and a similar analysis was undertaken to determine the differences between the 
treated and control groups. Finally, all of the components identified ware compared between the 
two experiments to obtain a profile of the biological system. 
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A representative excerpt showing differences observed among metabolites and peptides 
. is shown in Figure 45A. (These components may also be observed in the correlation network 
analysis (Figjire 46) where they display correlations among themselves as well as with other 
identified peptides and metabolites.) By viewing the data in this representation, one can.see, for 
example, that levels of two serum proteins (Protein 1 and Protein 2) wexe found to be 
differentialiy and pppositely regulated between diseased and cohtol rodents (vehicle treated), 
•and; that treatment with drug essentially lowers diseased Protein 1 levels to that of the control 
animals wliile increasing Protein 2 to levels appro^omately two-fold higher than the controls. 
'Another interesting observation is the differential effect of drug treatment on select lipid levels. 

- * * -ifote feat, for each molecxdar component, the results are presehtedm^ * 

1. diseased + vehicle / control + vehicle;... JBffect of disease!-. 

2; diseased + drug;/ diseased + vehicle.:. ... :Epect of drug treatment on disease state. 

3.1. diseased* drugY control * drug,. ^-Comparison of 

treatedcbntrdL 

? 4l diseased + drug 7 control + vehicle.... Comparison of drug-treated disease with 

untreated control. 

5. control -f drug / control f vehicle. ; . . . "Side effect 5 ' of drug. 

Ihisisthe order of presentation for ^analyses of the rodent serum.samples throughout the 
Example for the instances where all five comparisons have been made. 

Results and discussion for the rodent model of metabolic disease regarding analyses 
of serum samples - Correlation network analysis. In addition to changes in component 
abundance leyels between groups, the examM 

components is useful to reveal-important relationships among the various components, studied. 
Such a correlation analysis is complemerita^ to abundance level information, and often; provides 
information about the biochemical processes underlying : the disease or dug response. 

Figure 46 is a representative correlation network derived from the proteomic, 
metabolomic and clinical chemistry data in the pairwise comparison of the eight diseased drug- 
treated rodents and the eight diseased vetecle-treated rodents(drag effect on disease state). As 
can be seen m the legend, the components (or 'nodes') of the network are the various proteins, 
metabolites or clinical chemistries measured by the various platforms. All of the nodes in this 
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figure, and in figures similar to this one, are components which have: (i) been identified, and (ii) 
exhibited a fold-change greater than ±15% with p < 0.05. 

There are a number of independent levels of information displayed in this type of 
correlation network. First; the particular shape of a node represents the platform that was used to 
5 measure the component For example, in Figure 46, the square shaped nodes are peptides which 
have been measured and identified (i.e., sequenced and validated) by mass spectrometry. 
Second, the shading of a given node reflects the abundance difference in the sera of the two 
groups being compared; this is a normalized group mean difference. Third, the lines between 
pairs of nodes represent correlations in which the Pearson coefficient is between 0.80 and 1.00, 

10 or -0.80 to 4 .00. N egative correlation values aire presented as light lines, while positively ~ 

correlated components are connected visually by dark : lines in the graphical representation. 
Generally speaking, two components which are positively correlated reflect a statistically 
significant mutual behavior characterized by a change in one component being concomitantly 
related to a similar change in the second component, across all samples in the group. A trivial 

15 example may be pairs of peptide components from the same protein Which behave similarly, or 
two NMR resonance components from the same molecule. Biochemically relevant correlations 
may also be observed, such as between metabolites that are part of the same biosynthetic 
pathway or between entities that are components of the same macromolecular structure. Ah 
©cample of this type of correlation is shown in Figure 46, where the Protein 2 peptide is highly 

20 positively correlated with a number of lipid components in the serum; this high degree of 

correlation suggests that these lipids may share the same lipoprotein origin as Protein 2 in serum. 
Negative correlations may, for example, arise between components that are part of the same 
pathway, but where they might be separated by a point of enzyme inhibition or substrate 
limitation. In addition, components that Sail past committed biosynthetic branch points may 

25 show negative correlations with one another. 

The overall topology of the structure is what is referred to as self assembling and reflects 
clusters of components which are highly inter-correlated. Those nodes which are close to one 
another reflect a particularly high density of mutual correlation. The topology is generated in an 
unsupervisedand automated iashiort 

30 By investigating such structures, a number of interesting observations become apparent 

For example, it is seen that Lipid 2 is higher in abundance upon treatment (the node is at 
approximately 4 o'clock in the largest circular structure), and furthermore it is negatively 
correlated with many other lipid components. It should be understood that this figure is 
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illustrative of the principles and techniques of the invention; it is one of many such correlations 
that axe possible. 

Results and discussion for the rodent model of metabolic disease regarding analyses 
of serum samples - Heat plot analysis. An alternate view of the correlation information for the 
comparison of diseased drug-treated and diseased vehicle-treated groups is shown in Figure 47. 
This "heat plot" shows an array of correlation coefficients calculated for each pairing of 
identified metabolite and peptide peaks. The color of the off-diagonal spot for a pair of 
component peaks corresponds to the sign of the correlation coefficient between die peaks (either 
positive or negative), while the color intensity is proportional to the magnitude of the correlation. 

Though comply this visualization enables a rapid-infection of the complete-array >of — 
correlations; When the components are grouped according to analytical method as shown in 
Figure 47, correlations between different component classes are apparent For example, the off- 
diagonal area that lines up with peptides of index numbers of 22-32 and lipids of index numbers 
110-140 shows regions of both positive and hi^ In tins case, the 

positively correlated peptides (22-26) are from Protein 1 while the lipids are triglycerides. Note 
that foldrchange information -is not represented in Figure 47; the shade scale represents the 
Pearson correlation coefficient 

Results and. discussion for the rodent model of metabolic disease regarding analyses 
of serum samples - Rodent protein ratios, Certaih prdteiris play ah integral role in lipid 
metabolism. It is titarefore not surprising that differences in the levels of peptides associated 
with some of these proteins are found in the different sample cohorts examined as part of this 
study. Figure 48 illustrates the differences in four;such proteins, Protein A (Protein 1), Protein 
B, Protein G and Protein D (Protein 2), represented as ratios between different groups. Six 
tryptic peptides were observed from Protein A, one from Protein B, one from Protein C and two 
from Protein D. The plot in Figure 48 shows ratios between groups based on the means of the 
peak intensity values within each group (after normalization and scaling). It is apparent that 
significant fold changes exist between the different groups. Particularly striking are the Protein 
D ratio changes between diseased rodents treated with drug and diseased rodents treated with 
vehicle as well as between the diseased rodents treated with vehicle and the control subgroup of 
rodents treated with vehicle. 

Results and discussion for the metabolic syndrome study regarding analyses of 
human serum samples - Unsupervised clustering. Unsupervised clustering was applied to the 
human data derived using all individual platforms, protein, lipid, and NMR. As mentioned 
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above for the rodent model of metabolic disease, this allows grouping of samples with no 
foreknowledge of sample classification or the number of distinct groups. COSA analysis of the 
peptide data grouped the samples into four weak clusters. Clustering using the NMR Global 
metabolite data split the samples into two groups. Once the sample information was unblindedit 
5 was apparent that these groupings did not correspond to the diseased vs. control cohorts. 

In contrast COSA analysis of lipid data suggests two clusters (Figure 49). The COSA 
. distance clustering used 779 human LC/MS lipid peaks. These clusters correspond to the 
diseased patients with 86% accuracy (12/14) and the control subjects v0l S9S& accuracy (25/28). 
Multivariate analysis indicated that lipids were the strongest cUsoiminator between diseased and 

10 control samgles.. - — — — " 

The lack of strong clustering in 2 out of the 3 platforms indicates tiiat clustering is 
dominated by other factors such as medications, gender, age or environment Given these weak 
clusters derived using COSA for some of the platforms, other clustering techniques, such as K- 
Means and neural networks, were investigated using the same data set. These techniques gave 
15 results similar to COSA, with the exception of a few samples at the boundaries between groups. 
Results and discussion for the metabolic syndrome study regarding analyses of 
human serum samples -Metabolite and peptide peak identification. As was seen in the 
rodent study, potentially interesting peaks can be found -by highlighting those that differ 
significantly in level between sample types. For the purpose of this study, the human samples 
20 were first divided into the two groups (14 disease patients and 28 control subjects). A two 

sample t-test was performed for each peak to test for mean differences between the two groups, 
and this resulted in a list for peaks submitted for identification. 

For the lipid platform, a subset of peaks that exhibited differences between diseased 
patients and control subjects was identified using a reference database as well as targeted 
25 MS/MS methods. In general, upon peak identification, it was found that the levels of certain 
lipid molecules in diseased patients were significantly different from the levels of these lipids in 
control subjects. Interestingly, as seen in the rodent/human comparison study below, many of 
these lipid levels are also significantly different in diseased rodents compared to control rodents. 
Additionally, a list of human proteins was identified as part of this study using the 
30 "shotgun" tandem mass spectrometry (MS/MS) method. There was no overlap between the set 
of peaks which were selected during the MS profiling stage, for sequencing by shotgun MS/MS, 
and the set of peaks which exhibited statistically significant level differences between the two 
groups of human samples serum. 
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Results and discussion for the comparison of rodent samples with human samples. 

In this portion of the study, the obj ective was to compare die lipid components in the serum from 
diseased vehicle-treated and control vdricle^treated rodents to the corresponding lipids in the 
serum from diseased and control humans. No drug treatment groups were involved in these 
5 analyses. The data fiom the LC/MS serum lipid platform were used, specifically the 571 LC/MS 
Speaks common to both species. Figure 50 shows the workflow for this analysis. . 

In this framework, two issues were addressed, Th^furstis^coiicernedthe accuracy in 
clustering emd citifying human samples based on rodent measurements, and the sec ond issue 
regarded a comparison : ,acpss^;t#6<:^pei3fes' of Hpid.abundance chahiges and correlations* 

: 10 jl * Results and discussion-for4he-eomparison of rodent samples wittrhuman samples"-" 

Clustering and classification. Among the 571 peaks that were common to both species, in 366 
^ere were agr^cant mean changes two rodent groups (at ^significance level of 

6.05 and using two-tailed pairwiseMests). As an exploratory step, this set of 366 peaks was 
usedtbde^ 

15 : humans together vwth the diseased vehicle-treated rodents and the contioi humans together with 
the control yebicie-teated rodents. The, 5 results pfthis analysis are shown in Figure 5.0A. 
Specifically, the results of a COSA analysis of human serum sampl^^^ 
used for classification consisted of 366 lipid peaks chosen from the diseased rodent model, is 
shown; The figure reveals two xnain gn>ups,xorr^ 

20 samples: 27 ofthe28 cbntol to and all 8 control rodents belong to one group, and 11 of 
the 14 dise&edh^ It is concluded fiom 

tihds analysis that if the diagnosis of the huiniahs was not known, it could deduced with high 
accuracy by inspecting?the clustersfoimed in the two rodent group's. 

For classification purposes a support vector machine (SVM) linear classifier was used in 

25 which the 366 rodent.lipid measurements served as the model building set and the corresponding 
366 human Upid measurements as an independent test set The percentage of human samples 
correctly classified varied between 76% (32 of the 42 samples) and ,93% (39 of the 42 samples) 
as seeninFigure 51. Figure 51 shows the success rate of an SVM linear classifier as a fbnction. 
of number of lipid peaks. In this analysis, the rodent data are used formodel building, and the 

3 0 success rate is the percentage of rodents correctly classified in a leave-one-out procedure. Also, 
in this analysis, the human data are used as a test set, and the success rate is the percentage of 
humans correctly classified by the rodent model. Furtherinvestigation of the classification and 
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peak reduction procedures may lead to the cornirmanonmat die diseased rodent model is a good 
model for metabolic disease in humans. 

Results and discussion for the comparison of rodent samples with human samples - 
Common components. A comparison of the 571 LC/MS lipid peaks that were common.tp both 
species revealed mat there were significant mean differences in both species between the 
diseased and control groups (at a significance level of 0.05 ^ 

for 195 out of the 571 lipid LC/MS peaks* 6f these 195 peaks, 185 exhibited the same trend in 
both species (higher or lower serum abundance in diseased vs. control). In addition, a number of 
correlations between pairs of lipid peaks were present bom in the human and rodent samples, 

_ using anabsplute value of Eearson correlation coefficient greater than 0.7, indicating mat not 

only were me abundance differences conserved, but also that underlying mechanisms involved in 
the regulation of those hpid levels may likely be conserved across species. An excerpt of the 
iesults are summarized in Figure 52; 

More specifically, Figure 52 shows comparison of lipid abundance changes and 
correlations across human and rodent species. In the figure, the large circles consist of elements, 
each of which representing a different LC/MS hpid peak. The shading of the elements 
corresponds to the relative abundance of the lipid in diseased vs. control samples. The relative 
abundances are normalized group mean differences. There are 195 such elements, all 
representing lipids with p<0.05. The outer large circle represents the diseased rodent vs. control 
rodent group comparison, while the inner concentric circle represents the diseased human vs. 
control human group comparison. The tines connecting pairs of elements in the figure are 
correlations, of Pearson coefficient \C%\ > 0.70,.which are present in both species. 

Summary and conclusions. Metabolite and protein analyses of blinded serum samples 
fiom animal and human subjects Were performed which allowed grouping of the samples based 
on their serum metabolite and protein profiles. Groups identified using clustering analysis 
reflected with 100% accuracy the phenotypic categories of the animal subjects and with a high 
degree of accuracy (>80%) the human subj ects. Subsequent analyses identified many of the 
molecular components that differentiate the subjects. 

These independent measures are informative in themselves. Moreover, when linked 
using correlation networks, one begins to see details of the biochemical processes that underlie 
the disease or drug response. One of the more mteresting results is that the molecular 
components that differentktethe diseased rodents fiom the control rodents are very similar to 
those that differentiate the diseased humans fiom the control subjects. The wealth of data 
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generated by this study illustrates Hie strengths of the Systems Biology approach utilizing an 
integrated platform of proteomics, metabplomics and informatics technologies. 
Nomenclature / Terms Used In This Example 

Abbreviations arid Terms 
5 COSAi Clnstering Objects on Siibsete of Attiributes 

CPMGNMR: C^-PurceU»Meiboom-Gill spin echo NMR 

DENMR: Dif^on^itedmtR 

LC: Eigiud CSiromatogr^hy 

MS/MIS: Tandem Mass Spiectrbmetry 
id; ^, rc MS:. — : — — r-Mass Sp&froniefrjr - -■■ — — ~ 

;NMR: NuclearMa^^^ 

Protein Nomenclature 
Shotgun sequencing: ame^d;<$^ 
mass*^^^ 
15 ii$^ 

In vthis 'mbde,^^ consists of an initial survey scan 

of peptide peak sigrialsto select the ^e£ ^ 
^MS/MS^searis fbreach^fltte 
Pargeted sequencing. & 
20 mass ^'^tra ^S/M1S) that wej^ acquired. for ^c^^ peptide peafcsV 

Example 6. Systems biology approach: Human cardiovascular disease 
The in this iE 

cardiovascular disease patients ifrom ^ health^isuly ecfe In. advance of the study, thesubject 
samples were:ciassified into either diseased of control categories Q)lasma samples from 

25 cardiovascular disease and matched, - control subj ects). Several metabolomics platforms that use 
NMR, LC/MS, .and GC/MS technologi^and data preprocessing sbf^ 
comparative study of 30;plasma samples; Tte metabolbniics proMag.platfon^ generate 
datasets containing hundreds of ^edral^peaks that were initially ndt identified. Instead, peaks of 
statistical significance were determined. These entities were flagged for identification, using 

30 databases, additional MS/MS data, and expert interpretation, in the second phase of the analysis. 
Univariate and multivariate statistical analyses of the metabplomics datasets revealed measured 
features that were significantly different between the two groups-of study .subjects. Prior to the 
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initiation of the second phase of fiie project, further classification of the diseased subjects onthe' 
basis of a clinical index of disease severity ivas used and additional statistical analyses were 
performed if any measured features correlate with the severity of the cardiovascular disease in 
the diseased group. Numerous features showed significance in one or more analysis and wias 
identified. Then, a correlation network was constructed to visualize statistical and biological 
relationships among the identified, sigmficant metabolites. 

Objective* The goal of this study was to identify biomarker molecules as molecular 
differences between plasma samples taken fiom cardiovascular disease patients and matched 
control subjects. 

- Study design. The study was executed in two phases: ~ 

• Phase I: metabolbxnics platforms were employed to comparatively profile 80 plasma 
samples described as being from either male cardiovascular disease patients (40 samples, 
mean&ge 53.4 years) or age-matched controls subjects (40 samples, mean age 5 1 A 
years), The analytical platforms were CPMGNMR, diffusion^ped ^NMR, GC/MS, 
Lipid LC/MS, and Amino acid/global LC/MS. Software algorithms were used to extract 
spectral and chromatographic peak information from the raw data. Additional 
preprocesmng was preformed to align the peaks among the datasets from each platform 
(i.e., chromatographic retention, time alignment for LC-:and GC/MS) for comparative 
statistical analyses. The peaks remained unidentified until flagged for Identification on 
the basis of statistical significance. Identification activities were initiated on peaks that 
had different levels of abundance between the two experimental groups. 

• Phase H: Prior to the initiiation of the second phase of the project, further classification of 
the diseased subjects on the basis of the clinical index of disease severity was made and 
additional statistical analyses werepetformed to determine if any measured features 
correlated with the severity of the disease in the diseased group. Where possible, further 
identification information was obtained for features deemed significant A correlation 
network was then constructed to visualize statistical and biological relationships among 
the identified, significant metabolites, 

Summary of methods. A number of analytical methods were used that enable the 
comparative profiling of a wide range of metabolites. The samples were analyzed using several 
analytical methods, and statistics were performed on unidentified peaks. Listed and briefly 
described below were the methods that were used. 



WO 2005/020125 



62 



PCT/US2004/027022 



CPMG NMR: enhanced NMR measurement of low molecular weight metabolites 
at concentrations greater than 100 jiM (e.g., amino acids, amino acid metabolites, 
organic acids, sugars). 

GC/MS: gjobal method designed for profiling of a wide range of metabolites 
classes (e.g., alcohols, aldehydes and cyclohexanols, amino acids, acyl amino 
acids, succinylamino acids, amines, aromatic compounds, fettyacids (greater than 
C6), organic acidsj phospho-organic acids, sugars, sugar acids, sugar amines* 
sugar phosphates). 

Lipids LC/MS: optimized for profiling of lipids and non-polar metabolites (e.g., 
lysophosphoKpids, phospholipids, cholesterol esfi^s, diacylglycerols, ~ ~~ 

tnacylglycerob) 

Amino acids/global LC/MS: optimized for profiling of amino acids and polar 
metabolites. Due to the presence of citrate, used as a blood anticqagiriant, this 
platform did not yield usable data and was not used in Phase IL 
pifEi^pn-^ted.NMR: enhanced measurement of lippprotein-^ssociated 
metabolites: The profiled peals are composites of signals from many lipid 
moieties and are therefore non-specific. Since uniquely identified molecular 
entities were preferred as biomarkers, this method was not pursuedin Phase DL 

Each of the above analyses yielded raw datasets that contain hundreds to thousands of 
20 peaks per sample. In order to enable comparative analysis of metabolite peak information across 
the entire sample 3et, several algorithms were applied to each raw data file for peak detection and 
signal integration. Next, to compensate for minor shifts in peak position that may occur in terms 
of retention time for LC/MS and dC/MS techniques or minor differences in chemical shift for 
the NMR techniques, algorithms were used to "align" the peaks. As a result of this process, each 
25 metabolite peak within a profile was assigned a peak identification number (or index number). 
This same identification number was used to describe the analogous peak found in the profiles 
from all other samples and therefore enabled comparative analyses of the integrated peak 
intensities. 

Following univariate and multivariate statistical analyses of the data from each platform, 
30 metabolites that differentiated the diseased and healthy subjects were listed for identification in 
Phase II as ranked by the applied statistics. 



(i) 
(iO 

5 

(iii) 

-30 - ~- 

(iv) 

15 (v) 
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Univariate results. Subsequent to data alignment and nonnalization, univariate 
homoscedasuct-testswith controls for false discovery rates were performed on identified 
metabolite analytes from all bioanalytical platforms used in the present study. Results showed 
twenty-murand^ 

control using the Benjamiri-Hochberg approach. 

Multivariate results. A multianalyte approach to finding sets of spectral peaks capable 
of categorizing diseased samples and control samples was also pursued. In the Uterature,this 
problem of finding a biomarker composed of more than one molecular component able to 
segregate groupsof samples is referred to as a 'classification problem.' In the present case^only 
- -those analytes wHchhadbeen^onfide 

four such analytes at me time of the analysis. This number does not include isotopes, adducts, 
redundant 1 NMR resonance peaks, and the ^^bA^»-^ WidenttM. The 
Challenge of classification, in brief, is to detennme a multianalyte biomarker composed of the 
minimal number of most informative analytes. 
15 m consideringbiomarkers composed of more than one,component,anumber of points 

were considered. These mdude detemuning which subset of analytes is Ae optimum one to 
include in the marker; how well me find biomarker performs in correctly classifying the sample 
set atband; and how weU me fmd biomarker performs mw^ fro^ an 

independent sample set In addition to the above items, fixe biochemical relevance of the 
»0 components constituting the biomarker is also important, as is the feasibility of developing a 
practical diagnostic assay for the final biomarker. With the latter m inmd, me minimal optimal 
number of analytes which will acMevemebest.predcuvepe determined. 
Figure 53 depicts the outline of the steps of this analysis. In general, multiple data sets fiom 
multiple analytical platforms are concatenated, integrated, and correlated, and then are 
25 normalized. This data is further analyzed through a supervised clustering analysis to obtain a 
profile of a biological system. A brief overview of the methodology of constmcting a 
multianalyte biomarker is presented bdow. 

In order to determine the mimmd optimal subset of spectral peaks which best segregate 

diseaseandconteolsamp^ ™ s 
30 approach proceeds as follows. 

L Choose a 'classification dgorimm' which accepts as inputs components fi-e., AT spectral 
peaks), and returns (i) the success of segregating control and disease samples (as 
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measured by specificity and sensitivity) achieved by a linear combination of the N 
• components, and Qi) a ranking of the // input components based on their contribution tp 
the classification. 

2. Allow all analytes (aligned, normalized and pre-processed) as input to the classification 
5 algorithm. 

3 . With these components as inpiits, run the algorithm to converge upon a linear 
combination of input analytes to be used to classify control and disease samples. 

4. Record the ranldngxriferion ( c w The weights are the 
coefficients in the linear combing by the 

.40 - algorithm (the final *wei^4s-actually a mean weight; averaged-oveonultiple Crossr — ~ 

Validation iterations). 

5. Compute the ^Cross-Validation 1 performance of iids -combination of spectral peaks in 
classifying control and disease samples using the Cross-Validation method (discussed 
belbw), as well as the standard error fo^^ tests. 

15 6. Remove/the analytew^ weight 

7- Rjepeat Step 3 through Step 6,vurxtil only^pne andyte remains. 

8. Determine the minimum number, of analytes required to achieve the highest success in 
segregating control and disease samples; this biomarker is composed of at linear 
combination of analyte values, the coefficients in tiie combination being the weights 
20 corresponding to each analyte. 

The term 'Recursive Feature Elimination' reflects the successive pruning of the list of spectral 
peaks by one spectral peak for each iteration of Steps 3 through 6. 

In the present study, one classification algorithm was applied; This algorithm involves a 
state-of-the-art approach referred to as a 'Logistic Classifie^ , (Anderson, 1982). This method 
25 hasitis origins in handwriting. and biometric pattern recognition. It is designed to select for a 
final biomarker comprising components with iow mutual correlation, a desirable trait to avoid 
redundancy and minimize biomarker size. While the general principles of the techniqueare 
known, the current analysis optimizes it to work with data derived from the particular 
bioanalytical profiling platforms discussed earlier. 
3 0 There are two different tests of performance which have been applied for the processes 

outlined in this section. 

1 . c Cross-Validation Performance' is the classification success of .a biomarker which has 
been constructed based on a subset of the available samples, and tested on the remaining 
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samples whickh^ Atypical 
situation for the present study is to construct a biomarker based only on thirty-two (34) 
diseased samples and thirty-two (34) control samples chosen at random, and to testthe 
performance (classification success) of the resultant biomarker in classifying the 
5 remaining six (6) diseased and six (6) control samples which were excluded. This 

process is repeated successively many times, with different sets of randomly chosen 6+6 

samples Heft out' . The reported ^ Gros^VaUdation Performance' ^ 

averaged performance of many such permutations; typically ten cross-validation rounds 

-areusecL 

10 " ~ It is Mpo^t to n^ of Cross-Validation is to assess the 

generalizability of a biomarker, within Ihe liinitations posed t>y the availability of a 
relatively limited number of independent samples. In the absence of independent 
samples from a different population of patients, the Cross-Vahdafion Performance is an. 
estimation of the performance of the biomarker on an independent test set of samples, 

15 Such an extrapolation is made possible by measuring the perfoimance of the biomarker 

on the many permutations and combinations of subsets of the available samples; this 
process effectively simulates a situation in which many more samples are available. 
2. 'Permutation Performance* is the performance of the multivariate biomarker selection 
algorithm when sample labels have been randomly permuted. This occurs over may such i 

20 random permutations, and the average performance is reported; A robust classifier— one 

which is not overfit to the training set— should yield a permutation performance of 
approximately 50% (Le., chance performance). 

Results and discussion. The results ofthese classification methods are graphically 
shown in Figure 54. A biomarker set of fifteen molecular components was identified as part of a 

25 profile the human cardiovascular disease. These molecular components of the biomarker set 
were discovered by using multivariate statistical analysis methods and integration of a plurality 
of datasets including those for more than one type of measurement technique and those for more 
than oneKomolecular component type as shown in Figure 56. This methodological approach 
was used successfully to generate a biomarker set which could classify the 80 samples. Figure 

30 55 shows the classification of each subject as a disease or control group member using these 
biomarkers. A sensitivity of 93% and a specificity of 94% were obtained. 
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The abbreviations used in this example are, where appropriate, the same as those used in 
Example 5. 

Each of the patent documents and scientific publications disclosed hereinabove is 
incorporated by reference herein for all purposes. 

Although the invention has been particularly shown and described with reference to 
specific embodiments, it should be understood by those skilled in the art that various changes in 
form and detail may be made therein without departing from the spirit, essential characteristics , 
or scope of the;invention. The foregoing embodiments are therefore to be considered in all 
— xespiects illustrative rather :jfoanlimtmgont|iei The scope of the 

inventionis thusindicated by the appended claims rather tihm by the foregoing description, anil 
all changes which come within the meaning and range of eqmvalency of the claims are therefore 
intended to be embraced therein. 
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What is claimed is: 

1 h A method of profiling a state of a biological system in a mammal, the method 

2 comprising the steps of: 

3 (a) evaluating with statistical analysis a plurality of data sets of a biological system 

4 and ccmparmg features amoh^ to determine one or more sets of 

5 differences among at least a portion of me plurality of data sets; and, 

g (t>) developing a profile for a state of the biological system based on the results of 

7 step(a), 

8 wherein the plurality of data sets comprise measurements derived from more than o ne 
~"9~ biological sample type, more than one type of measurement technique, more man one 

1.0 biomolecular component type, or a combination of at least two of a biological sample type, a 

11 measurement technique, and a biomolecular component type. 

I 2. The method of claim 1 wherein the biological system is in a human. 

1 3. The method of claim 1 wherein the statistical analysis comprises multivariate 

2 analysis. 

1 4. The method of claim 1 wherein the biological sample type is selected from the 

2 group comprising blood, plasma,' serum, cerebrospinal fluid, bile, saliva, synovial fluid, pleural 

3 fluid, pericardial fluid, peritoneal fluid, sweat, feces, nasal fluid, ocular fluid, intraceUular fluid, 

4 interceUular fluid, lymph, urine, liver cells, epithelial cells, endothelial cells, kidney cells, 

5 prostate cells, blood cells, lung cells, brain cells, skin cells, adipose cells, tumor cells, arid 

6 mammary cells. 

1 5. The method of claim 1 wherein a plurality of data sets are derived from one 

2 biological sample type that is treated differently, or from one biological sample type that is 

3 collected or analyzed at different times. 

1 6. The method of claim 1 wherein the measurement technique is selected from the 

2 group comprising liquid chromatography, gas chromatography, high performance liquid 

3 chromatography, capillary electrophoresis, mass spectrometry, liquid chromatography-mass 

4 spectrometry, gas chromatography-mass spectrometry, high performance liquid chromatography- 
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5 mass spectrometry, capillary electrophoresis-mass spectrometry, nuclear magnetic resonance 

6 spectrometry, parallel hybridization assay, parallel sandwich assay, and competitive assay. 

1 7. The method of claim 1 wherein a plurality of data sets comprise measurements 

2 from different instrument configurations of a single type of measurement technique. 

1 8". The method of claim 1 wherein the biomolecular component type is a gene, a 

2 . gene transcript* a protein, or a metabolite. 

1 9. The method of claim 1 comprising ^the step of comparing the profile for a state: of 

2 the Ipiolofflcal system to a database j>f profiles. m - . . _ v 

1 10. The method of claim ! comprising comparing the profile for a state of the 

2 Biological syistem to a ^profile of another stat6 of a biological system. 

1 . 11. An article of manufacture ha^ 

2 computer-reaidable instructions embodied thereon ibr performing the method of claim 1. 

1 12. A method of profiling a state of a biological system in a mammal, the method 

2 comprising the steps of: 

3 (a) evaluating with statistical analysis a plurahty of data sets for a biomolecular 

4 component type and comparing features among the plurality of data sets to determine one or 
5j more sets of differences amongat least a portion of the plurality of data sets; 

* 6; (b) evaluating with statistical analysis a plurality of data sets for another biomolecular 

7 component type and comparing features among Ae plurality data sets to determine one or 

8 more sets of differences among at least a portion of the plurality of data sets; and 

9 (c) correlating the results of step (a) and step (b) to develop a profile for a state of the 
10 ' biological system. 

1 13. The method of claim 12 wherein the plurality of data sets for a biomolecular 

2 component type or another biomolecular component type comprise measurements derived from 

3 . more than one biological sample type, more than one type of measurement technique, or a 

4 combination of a biological sample type and admeasurement technique. 

1 14. The method of claim 12 wherein the biomolecular component type is a protein 

2 and the other biomolecular component type is a metabolite. 
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1 IS, The method of claim 12 wherein the biomolecular component type is a gene 

2 transcript and the other biomolecular component type is a metabolite. 

1 16. A method of profiling a state of a biological system in a mammal, the method 

2 comprising the steps o£ 

3 ( a ) evaluating with statistical analysis a plurality of data sets comprising 

4 measurements from at least two biomolecular component types and comparing features among 

5 the plurality of data sets to determine one or more sets of differences among at least a portion of 

6 the plurality of data sets; and 

7 (b) developing a profile for a state of the biological system based on the results of_ 

8 step (a). 

1 17. The method of claim 16 wherein fhe plurality of data sets comprise measurements 

2 derived from more than one biological sample type, more than one type of measurement 

3 technique, or a combination of a biological sample type and a measurement technique. 

1 i s. The method of claim 16 wherein the step of evaluating comprises: 

2 evaluating a plurality of data sets for a biomolecular component type and comparing 

3 features among the plurality of data sets to determine one or more sets of differences among at 

4 least a portion of the plurality of data sets; and 

5 evaluating a plurality of date sets for another biomolecular component type and 

6 comparing features among the plurality of date sets to determine one or more sets of differences 

7 among at least a portion of the plurality of data sets. 

1 19. The method of claim 16 wherein the at least two biomolecmar component types 

2 comprise a protein and a metabolite. 

1 20. The method of claim 16 wherein fhe at least two ttomplecular component types 

2 comprise a gene transcript and a metabolite. 
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Figure 32B 
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Figure 34 
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Figure 38 
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Figure 44A 
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Figure 44B 
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Figure 46 
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Figure 47 
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O = Control Human 
□ = Control Rat 
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■ = Diseased Rat 




WO 2005/020125 PCT/US2004/0 27022 

55/60 




KgureSl 



WO 2005/020125 



PCT/US2004/027022 



56/60 



C1&1 lipid 
C20:4 lipid 
C20:3 lipid 
ClfclBpId 
016:0 lipid , 
022:6 Opto . 
C32:1Ilpld. 
C342itpM ; 
C44M l^>kt 

C48»BpFd 
C4S2 Spld , 
C4fc1 lipid 

C5fc5 lipid , J 

CS&4 lipid - 
C503 lipid - 

CSp^npJd • 
C5Q:1 Qpld 



051:2 lipid 
C52^tfpld 
'052*5 Bptd ' y-<fcSS, 
C52*JJp!d 

C522 Bpa ^ 



C1S2 lipid 
/ C18-3fipW 



CIBiOfipH 



BpldC36:1 
/ BpIdC34rl 

/ / iipw cMa 

^ ^BpidC32*l 




C52TI lipid 



CSSZipW - >^ 
C54.-7 lipid ^ 
C54:3 lipid 
C542fipk* 

C64:inpid 



C5&3 lipid 
C58:4|lpW 
C522l1ptd 



inner circle = diseased humans vs. control humans 
outer circle - diseased rodents vs. control rodents 



@ lipid lower in disease 
0 lipid higher in disease 

negative correlation 
mmm positive correlation 



Figure 52 



WO 2005/020125 



57/60 



PCT/US2004/027022 



CO 










previou 


uits as 


I 

CL 


^logical 


stem 




resi 


E 


In 


>» 

to 


8 


'rriin 






8> 




*c5 





-O 




CD 0) -{g W 
S 5 o CD 



© a> ~, 

P f 1 ""J 



2 i 



E Q 
8 a> 



is 

£ £ CO 

co t *S 

a> a W 

c 2 £ 

o o> u 

a q) 



N CO 

= C CO 

CO m «E* 

. CO o 

o a Q 

z o 

o 



ised 


'ing 


> 


CD 


super 


dust 



co 



£ © := 
£ sF s *5 

JO Ol z o 



S23S 



CO 

5u J -o 



CQ 

£ 



CO 



c 



8 o *s 

5! CD J2 



WO 2005/020125 



58/60 



PCT/US2004/027022 



i 

5 




CO 



TO 

o 



o 

OJ 



8 § 3 
(%) ssoons uoijbobissed 



8 



WO 2005/020125 



59/60 



PCT/US2004/027022 



15-AnaIyte Biomarker Perfbrmaceon 80 Samples 



CO 

E T3 
O 0 ) 



Ar 



i 

ra 

E . 

ii: 

t£ g 

to 

o 



-i 



•2 



Control |W=46; 
Diseased (Af«-40) 



V* 



■ 
a 



* ■ 



Specificity = 93% 
Sensitivity = 94% 



10 



20 



30 



40 

Sample Index 



60 



60 



70 



80 



Figure 55 



WO 2005/020125 



60/60 



PCT/US2004/027022 



Fifteen Bf omarker Analytes 



Analvte 


Weitiht in Biomarker (arb. units) 


Platform 


Upid i" " ' 


0.42 * " 


Upid LCilflS 


Lipid 2 


0.33 


Lipid LC-MS 


Metabolite 1 


031 


GC-MS 


Metabolite 2 


0.30 


NMR 


Metabolite 3 


Q.30 


GC-MS 


Metabolite 4 


0.25 


GC-MS 


Lipids 


0.24 


Lipid LC-MS 


Metabolite 5 


0.23 


GC-MS 


Lipid 4 


0.21 


Lipid LC-MS 


Metabolite 6 


. 0.20 


GC-MS 


Metabolite? 


0.18 


NMR 


Lipid 5 


0.18 


Lipid LC-MS 


Lipid 6 


0.17 


Lipid LC-MS 


Upid 7 


0.04 


Lipid LC-MS 


Lipid 8 


0.01 


Lipid LC-MS 



Figure56 



This Page is Inserted by IFW Indexing and Scanning 
Operations and is not part of the Official Record 

BEST AVAILABLE IMAGES 

Defective images within this document are accurate representations of the original 
documents submitted by the applicant. 

Defects in the images include but are not limited to the items checked: 

□ BLACK BORDERS 

J^IMAGE. CUT GEE AT IOP,30TTOM OR SIDES 
FADED TEXT OR DRAWING 

□ BLURRED OR ILLEGIBLE TEXT OR DRAWING 

□ SKEWED/SLANTED IMAGES 

□ COLOR OR BLACK AND WHITE PHOTOGRAPHS 
GRAY SCALE DOCUMENTS 

□ LINES OR MARKS ON ORIGINAL DOCUMENT 

□ REFERENCE(S) OR EXHIBIT(S) SUBMITTED ARE POOR QUALITY 

□ OTHER: 

IMAGES ARE BEST AVAILABLE COPY. 
As rescanning these documents will not correct the image 
problems checked, please do not report these problems to 
the IFW Image Problem Mailbox. 




or 



